2025-12-04T08:53:38.5385011Z Current runner version: '2.329.0' 2025-12-04T08:53:38.5387945Z Runner name: 'linux.rocm.gpu.gfx942.1.b-gwk9b-runner-vcbrh' 2025-12-04T08:53:38.5388342Z Runner group name: 'default' 2025-12-04T08:53:38.5388772Z Machine name: 'linux' 2025-12-04T08:53:38.5389889Z ##[group]GITHUB_TOKEN Permissions 2025-12-04T08:53:38.5390913Z Contents: read 2025-12-04T08:53:38.5391160Z Metadata: read 2025-12-04T08:53:38.5391415Z ##[endgroup] 2025-12-04T08:53:38.5392412Z Secret source: Actions 2025-12-04T08:53:38.5392707Z Prepare workflow directory 2025-12-04T08:53:38.5626116Z Prepare all required actions 2025-12-04T08:53:38.5645482Z Getting action download info 2025-12-04T08:53:39.1583844Z Download action repository 'pytorch/pytorch@main' (SHA:ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32) 2025-12-04T08:53:43.4016851Z Download action repository 'pytorch/test-infra@main' (SHA:39aa74d619174326f4e2fb0e216151c2f29d9ffd) 2025-12-04T08:53:48.1212019Z Download action repository 'actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02' (SHA:ea165f8d65b6e75b540449e92b4886f43607fa02) 2025-12-04T08:53:49.3006765Z Download action repository 'aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722' (SHA:ececac1a45f3b08a01d2dd070d28d111c5fe6722) 2025-12-04T08:53:50.4373929Z Getting action download info 2025-12-04T08:53:50.6338939Z Download action repository 'actions/checkout@v4' (SHA:34e114876b0b11c390a56381ad16ebd13914f8d5) 2025-12-04T08:53:51.5133339Z Getting action download info 2025-12-04T08:53:51.7300005Z Download action repository 'nick-fields/retry@v3.0.0' (SHA:7152eba30c6575329ac0576536151aca5a72780e) 2025-12-04T08:53:52.8555728Z Getting action download info 2025-12-04T08:53:53.0725121Z Uses: pytorch/pytorch/.github/workflows/_rocm-test.yml@refs/heads/main (ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32) 2025-12-04T08:53:53.0731082Z ##[group] Inputs 2025-12-04T08:53:53.0731537Z build-environment: linux-jammy-rocm-py3.10 2025-12-04T08:53:53.0743962Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}]} 2025-12-04T08:53:53.0755006Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:53:53.0756000Z sync-tag: 2025-12-04T08:53:53.0757292Z timeout-minutes: 300 2025-12-04T08:53:53.0757645Z tests-to-include: 2025-12-04T08:53:53.0757958Z dashboard-tag: 2025-12-04T08:53:53.0758636Z disable-monitor: true 2025-12-04T08:53:53.0759001Z monitor-log-interval: 5 2025-12-04T08:53:53.0759378Z monitor-data-collect-interval: 1 2025-12-04T08:53:53.0759779Z ##[endgroup] 2025-12-04T08:53:53.0760448Z Complete job name: linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests, unstable) 2025-12-04T08:53:53.1132574Z ##[group]Run pytorch/pytorch/.github/actions/checkout-pytorch@main 2025-12-04T08:53:53.1132967Z with: 2025-12-04T08:53:53.1133055Z no-sudo: true 2025-12-04T08:53:53.1133147Z submodules: recursive 2025-12-04T08:53:53.1133244Z fetch-depth: 0 2025-12-04T08:53:53.1133384Z env: 2025-12-04T08:53:53.1133474Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:53:53.1133591Z ##[endgroup] 2025-12-04T08:53:53.1193540Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T08:53:53.1193917Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T08:53:53.1200301Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:53:53.1200456Z env: 2025-12-04T08:53:53.1200547Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:53:53.1200653Z ##[endgroup] 2025-12-04T08:53:53.1540854Z ##[group]Run actions/checkout@v4 2025-12-04T08:53:53.1541327Z with: 2025-12-04T08:53:53.1541684Z ref: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:53:53.1542101Z fetch-depth: 0 2025-12-04T08:53:53.1542397Z submodules: recursive 2025-12-04T08:53:53.1542946Z show-progress: false 2025-12-04T08:53:53.1543333Z repository: pytorch/pytorch 2025-12-04T08:53:53.1543883Z token: *** 2025-12-04T08:53:53.1544161Z ssh-strict: true 2025-12-04T08:53:53.1544451Z ssh-user: git 2025-12-04T08:53:53.1544750Z persist-credentials: true 2025-12-04T08:53:53.1545105Z clean: true 2025-12-04T08:53:53.1545416Z sparse-checkout-cone-mode: true 2025-12-04T08:53:53.1545777Z fetch-tags: false 2025-12-04T08:53:53.1546150Z lfs: false 2025-12-04T08:53:53.1546430Z set-safe-directory: true 2025-12-04T08:53:53.1546752Z env: 2025-12-04T08:53:53.1547029Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:53:53.1547342Z ##[endgroup] 2025-12-04T08:53:53.2250940Z Syncing repository: pytorch/pytorch 2025-12-04T08:53:53.2252936Z ##[group]Getting Git version info 2025-12-04T08:53:53.2253468Z Working directory is '/home/runner/_work/pytorch/pytorch' 2025-12-04T08:53:53.2254206Z [command]/usr/bin/git version 2025-12-04T08:53:53.2254693Z git version 2.52.0 2025-12-04T08:53:53.2255869Z ##[endgroup] 2025-12-04T08:53:53.2261191Z Copying '/home/runner/.gitconfig' to '/home/runner/_work/_temp/4bd8c2ce-58a5-4a29-bd00-102b7b4bb5df/.gitconfig' 2025-12-04T08:53:53.2262353Z Temporarily overriding HOME='/home/runner/_work/_temp/4bd8c2ce-58a5-4a29-bd00-102b7b4bb5df' before making global git config changes 2025-12-04T08:53:53.2263742Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T08:53:53.2264576Z [command]/usr/bin/git config --global --add safe.directory /home/runner/_work/pytorch/pytorch 2025-12-04T08:53:53.2265720Z [command]/usr/bin/git config --local --get remote.origin.url 2025-12-04T08:53:53.2266349Z https://github.com/pytorch/pytorch 2025-12-04T08:53:53.2267476Z ##[group]Removing previously created refs, to avoid conflicts 2025-12-04T08:53:53.2268121Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD 2025-12-04T08:53:53.2268656Z refs/heads/main 2025-12-04T08:53:53.2269392Z [command]/usr/bin/git checkout --detach 2025-12-04T08:53:55.2842094Z HEAD is now at ffd9b0fb4355 Resolve collective autotuning test failure on arm (#168919) 2025-12-04T08:53:55.2918353Z [command]/usr/bin/git branch --delete --force main 2025-12-04T08:53:55.3079016Z Deleted branch main (was ffd9b0fb4355). 2025-12-04T08:53:55.3090496Z ##[endgroup] 2025-12-04T08:53:55.3097451Z [command]/usr/bin/git submodule status 2025-12-04T08:53:55.3428897Z 7e1e1fe3858c63c251c637ae41a20de425dde96f android/libs/fbjni (v0.1.0-12-g7e1e1fe) 2025-12-04T08:53:55.3527750Z 4dfe081cf6bcd15db339cf2680b9281b8451eeb3 third_party/FP16 (4dfe081) 2025-12-04T08:53:55.3587670Z b408327ac2a15ec3e43352421954f5b1967701d1 third_party/FXdiv (b408327) 2025-12-04T08:53:55.3651800Z c07e3a0400713d546e0dea2d5466dd22ea389c73 third_party/NNPACK (c07e3a0) 2025-12-04T08:53:55.3691547Z 3ebbc93ded7285963bff932c678fa367eb393ba6 third_party/NVTX (v3.1.0-313-g3ebbc93) 2025-12-04T08:53:55.3741146Z 1d8f600fd424278486eade7ed3e877c99f0846b1 third_party/VulkanMemoryAllocator (v2.1.0-982-g1d8f600) 2025-12-04T08:53:55.4071505Z 51a0103656eff6fc9bfd39a4597923c4b542c883 third_party/XNNPACK (remotes/origin/ds/ndk-1243-g51a0103656) 2025-12-04T08:53:55.4119097Z 01aae101b9e5e94d6c16a9514c9fb8df99c93150 third_party/aiter (v0.1.1-92-g01aae101) 2025-12-04T08:53:55.4158351Z 299e5928955cc62af9968370293b916f5130916f third_party/benchmark (v1.9.3) 2025-12-04T08:53:55.4233085Z 7fe50dc3da2069d6645d9deb8c017a876472a977 third_party/composable_kernel (rocm-6.4.3-459-g7fe50dc3d) 2025-12-04T08:53:55.4341980Z 89c932f313c6437c38f2982869beacc89c2f2246 third_party/cpp-httplib (v0.26.0) 2025-12-04T08:53:55.4456759Z f858c30bcb16f8effd5ff46996f0514539e17abc third_party/cpuinfo (f858c30) 2025-12-04T08:53:55.4489353Z 0b1577c8c83401237d601d0d0db5210506705396 third_party/cudnn_frontend (v0.5-61-g0b1577c) 2025-12-04T08:53:55.4588986Z f88806b1e31dfa579842638740216dd41fc6c588 third_party/cutlass (v4.3.1) 2025-12-04T08:53:55.4634247Z c0b988d39a9e47c794d699f29930ed4d7c7e13a4 third_party/fbgemm (v1.4.0-rc1-2-gc0b988d39) 2025-12-04T08:53:55.4707965Z 979702c87a8713a8e0a5e9fee122b90d2ef13be5 third_party/flash-attention (v2.7.4) 2025-12-04T08:53:55.4736702Z a2cd1ea3b6d3fee220106b5fed3f7ce8da9eb757 third_party/flatbuffers (v24.12.23) 2025-12-04T08:53:55.5019623Z 407c905e45ad75fc29bf0f9bb7c5c2fd3475976f third_party/fmt (12.1.0) 2025-12-04T08:53:55.5146977Z 3fb5c176c17c765a3492cd2f0321b0dab712f350 third_party/gemmlowp/gemmlowp (remotes/origin/revert-87-master-135-g3fb5c17) 2025-12-04T08:53:55.5260842Z 54cbae0d3a67fa890b4c3d9ee162b7860315e341 third_party/gloo (remotes/origin/gh/c-p-i-o/1/base-37-g54cbae0) 2025-12-04T08:53:55.5392114Z 52eb8108c5bdec04579160ae17225d66034bd723 third_party/googletest (release-1.8.0-3544-g52eb8108) 2025-12-04T08:53:55.5449301Z 719d8e6cd7f7a0e01b155657526d693acf97c2b3 third_party/ideep (pytorch-rls-v3.7.1) 2025-12-04T08:53:55.5528435Z dec1d23ca65ab069d225dfe40dea14f455170959 third_party/ittapi (v3.25.5) 2025-12-04T08:53:55.5689534Z 31f85df8fbd89c188f14ef10f1ec65379786b943 third_party/kineto (heads/main) 2025-12-04T08:53:55.5706484Z d7770c89632329a9914ef1a90289917597639cbe third_party/kleidiai (v1.15.0) 2025-12-04T08:53:55.5741978Z fbd8b99c2b828428947d70fdc046bb55609be93e third_party/mimalloc (v2.2.4) 2025-12-04T08:53:55.5763563Z 55f93686c01528224f448c19128836e7df245f72 third_party/nlohmann (v3.12.0) 2025-12-04T08:53:55.5998216Z e709452ef2bbc1d113faf678c24e6d3467696e83 third_party/onnx (v1.18.0) 2025-12-04T08:53:55.6039240Z a799f4aed9c94b765dcdaabaeab7d5e7e2310878 third_party/opentelemetry-cpp (v1.14.2) 2025-12-04T08:53:55.6060710Z 0fa0ef591e38c2758e3184c6c23e497b9f732ffa third_party/pocketfft (release_for_eigen-40-g0fa0ef5) 2025-12-04T08:53:55.6284739Z d1eca4e4b421cd2997495c4b4e65cea6be4e9b8a third_party/protobuf (v3.7.0-rc.2-1279-gd1eca4e4b) 2025-12-04T08:53:55.6372141Z 072586a71b55b7f8c584153d223e95687148a900 third_party/psimd (heads/master) 2025-12-04T08:53:55.6455330Z 4fe0e1e183925bf8cfa6aae24237e724a96479b8 third_party/pthreadpool (0.1-144-g4fe0e1e) 2025-12-04T08:53:55.6478889Z f5fbe867d2d26e4a0a9177a51f6e568868ad3dc8 third_party/pybind11 (v3.0.1) 2025-12-04T08:53:55.6544007Z f45429b087dd7d5bc78bb40dc7cf06425c252d67 third_party/python-peachpy (remotes/origin/pre-generated) 2025-12-04T08:53:55.6611826Z 5a1d179df9cf652951b59010a2d2075372d67f68 third_party/sleef (3.8) 2025-12-04T08:53:55.6670452Z 2b4cd91092d335a697416b2a3cb398283246849d third_party/tensorpipe (heads/main) 2025-12-04T08:53:55.6688375Z ##[group]Cleaning the repository 2025-12-04T08:53:55.6696577Z [command]/usr/bin/git clean -ffdx 2025-12-04T08:53:55.6847987Z [command]/usr/bin/git reset --hard HEAD 2025-12-04T08:53:56.4111075Z HEAD is now at ffd9b0fb4355 Resolve collective autotuning test failure on arm (#168919) 2025-12-04T08:53:56.4226790Z ##[endgroup] 2025-12-04T08:53:56.4230198Z ##[group]Disabling automatic garbage collection 2025-12-04T08:53:56.4237920Z [command]/usr/bin/git config --local gc.auto 0 2025-12-04T08:53:56.4285905Z ##[endgroup] 2025-12-04T08:53:56.4286471Z ##[group]Setting up auth 2025-12-04T08:53:56.4297020Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T08:53:56.4345528Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T08:53:56.4659727Z Entering 'android/libs/fbjni' 2025-12-04T08:53:56.4696145Z Entering 'third_party/FP16' 2025-12-04T08:53:56.4735326Z Entering 'third_party/FXdiv' 2025-12-04T08:53:56.4778399Z Entering 'third_party/NNPACK' 2025-12-04T08:53:56.4802975Z Entering 'third_party/NVTX' 2025-12-04T08:53:56.4832945Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:53:56.4865805Z Entering 'third_party/XNNPACK' 2025-12-04T08:53:56.4896707Z Entering 'third_party/aiter' 2025-12-04T08:53:56.4925093Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:53:56.4962458Z Entering 'third_party/benchmark' 2025-12-04T08:53:56.5004686Z Entering 'third_party/composable_kernel' 2025-12-04T08:53:56.5044677Z Entering 'third_party/cpp-httplib' 2025-12-04T08:53:56.5111290Z Entering 'third_party/cpuinfo' 2025-12-04T08:53:56.5143200Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:53:56.5185265Z Entering 'third_party/cutlass' 2025-12-04T08:53:56.5215252Z Entering 'third_party/fbgemm' 2025-12-04T08:53:56.5248434Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:53:56.5308163Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:53:56.5341190Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:53:56.5364184Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:53:56.5402622Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:53:56.5437851Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:53:56.5483355Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:53:56.5518771Z Entering 'third_party/flash-attention' 2025-12-04T08:53:56.5570889Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:53:56.5598640Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:53:56.5639589Z Entering 'third_party/flatbuffers' 2025-12-04T08:53:56.5669301Z Entering 'third_party/fmt' 2025-12-04T08:53:56.5729780Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:53:56.5778237Z Entering 'third_party/gloo' 2025-12-04T08:53:56.5830848Z Entering 'third_party/googletest' 2025-12-04T08:53:56.5874013Z Entering 'third_party/ideep' 2025-12-04T08:53:56.5905229Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:53:56.5939789Z Entering 'third_party/ittapi' 2025-12-04T08:53:56.5965488Z Entering 'third_party/kineto' 2025-12-04T08:53:56.5991639Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:53:56.6016166Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:53:56.6052172Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:53:56.6084593Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:53:56.6120980Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:53:56.6162460Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:53:56.6194762Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:53:56.6217128Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:53:56.6242918Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:53:56.6266917Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:53:56.6330395Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:53:56.6390796Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:53:56.6442492Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:53:56.6468375Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:53:56.6499010Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:53:56.6522670Z Entering 'third_party/kleidiai' 2025-12-04T08:53:56.6549309Z Entering 'third_party/mimalloc' 2025-12-04T08:53:56.6633618Z Entering 'third_party/nlohmann' 2025-12-04T08:53:56.6671714Z Entering 'third_party/onnx' 2025-12-04T08:53:56.6707445Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:53:56.6738696Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:53:56.6764161Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:53:56.6787712Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:53:56.6809574Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:53:56.6831141Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:53:56.6852912Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:53:56.6880460Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:53:56.6902061Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:53:56.6923812Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:53:56.6948010Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:53:56.6971954Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:53:56.7002666Z Entering 'third_party/pocketfft' 2025-12-04T08:53:56.7026957Z Entering 'third_party/protobuf' 2025-12-04T08:53:56.7050938Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:53:56.7083125Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:53:56.7112235Z Entering 'third_party/psimd' 2025-12-04T08:53:56.7135727Z Entering 'third_party/pthreadpool' 2025-12-04T08:53:56.7159535Z Entering 'third_party/pybind11' 2025-12-04T08:53:56.7184463Z Entering 'third_party/python-peachpy' 2025-12-04T08:53:56.7232309Z Entering 'third_party/sleef' 2025-12-04T08:53:56.7260839Z Entering 'third_party/tensorpipe' 2025-12-04T08:53:56.7298551Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:53:56.7339420Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:53:56.7383742Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:53:56.7414847Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:53:56.7444312Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:53:56.7489551Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T08:53:56.7529158Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T08:53:56.7721415Z Entering 'android/libs/fbjni' 2025-12-04T08:53:56.7753806Z Entering 'third_party/FP16' 2025-12-04T08:53:56.7798369Z Entering 'third_party/FXdiv' 2025-12-04T08:53:56.7839323Z Entering 'third_party/NNPACK' 2025-12-04T08:53:56.7863993Z Entering 'third_party/NVTX' 2025-12-04T08:53:56.7896162Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:53:56.7925862Z Entering 'third_party/XNNPACK' 2025-12-04T08:53:56.7963398Z Entering 'third_party/aiter' 2025-12-04T08:53:56.7993024Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:53:56.8023686Z Entering 'third_party/benchmark' 2025-12-04T08:53:56.8060005Z Entering 'third_party/composable_kernel' 2025-12-04T08:53:56.8102347Z Entering 'third_party/cpp-httplib' 2025-12-04T08:53:56.8140608Z Entering 'third_party/cpuinfo' 2025-12-04T08:53:56.8174472Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:53:56.8203736Z Entering 'third_party/cutlass' 2025-12-04T08:53:56.8233186Z Entering 'third_party/fbgemm' 2025-12-04T08:53:56.8274031Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:53:56.8317904Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:53:56.8379192Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:53:56.8411396Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:53:56.8438005Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:53:56.8487861Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:53:56.8527401Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:53:56.8587835Z Entering 'third_party/flash-attention' 2025-12-04T08:53:56.8650287Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:53:56.8685274Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:53:56.8740229Z Entering 'third_party/flatbuffers' 2025-12-04T08:53:56.8790936Z Entering 'third_party/fmt' 2025-12-04T08:53:56.8826184Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:53:56.8863218Z Entering 'third_party/gloo' 2025-12-04T08:53:56.8908277Z Entering 'third_party/googletest' 2025-12-04T08:53:56.8948179Z Entering 'third_party/ideep' 2025-12-04T08:53:56.8998984Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:53:56.9030895Z Entering 'third_party/ittapi' 2025-12-04T08:53:56.9089750Z Entering 'third_party/kineto' 2025-12-04T08:53:56.9121240Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:53:56.9152195Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:53:56.9191170Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:53:56.9225647Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:53:56.9258820Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:53:56.9306367Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:53:56.9351827Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:53:56.9398750Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:53:56.9423823Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:53:56.9460077Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:53:56.9512108Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:53:56.9546192Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:53:56.9587615Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:53:56.9617716Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:53:56.9644331Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:53:56.9674011Z Entering 'third_party/kleidiai' 2025-12-04T08:53:56.9707967Z Entering 'third_party/mimalloc' 2025-12-04T08:53:56.9740723Z Entering 'third_party/nlohmann' 2025-12-04T08:53:56.9776824Z Entering 'third_party/onnx' 2025-12-04T08:53:56.9809081Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:53:56.9848508Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:53:56.9883752Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:53:56.9916055Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:53:56.9954266Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:53:56.9979159Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:53:57.0015254Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:53:57.0049515Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:53:57.0085355Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:53:57.0133434Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:53:57.0165721Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:53:57.0196978Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:53:57.0231506Z Entering 'third_party/pocketfft' 2025-12-04T08:53:57.0268949Z Entering 'third_party/protobuf' 2025-12-04T08:53:57.0298202Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:53:57.0348544Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:53:57.0401208Z Entering 'third_party/psimd' 2025-12-04T08:53:57.0451681Z Entering 'third_party/pthreadpool' 2025-12-04T08:53:57.0492959Z Entering 'third_party/pybind11' 2025-12-04T08:53:57.0523627Z Entering 'third_party/python-peachpy' 2025-12-04T08:53:57.0566185Z Entering 'third_party/sleef' 2025-12-04T08:53:57.0614612Z Entering 'third_party/tensorpipe' 2025-12-04T08:53:57.0666280Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:53:57.0698464Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:53:57.0737964Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:53:57.0782732Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:53:57.0835438Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:53:57.0899589Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.0923504Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T08:53:57.1137099Z Entering 'android/libs/fbjni' 2025-12-04T08:53:57.1156315Z file:/home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T08:53:57.1167001Z Entering 'third_party/FP16' 2025-12-04T08:53:57.1187943Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T08:53:57.1197107Z Entering 'third_party/FXdiv' 2025-12-04T08:53:57.1215181Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T08:53:57.1224628Z Entering 'third_party/NNPACK' 2025-12-04T08:53:57.1238385Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T08:53:57.1261494Z Entering 'third_party/NVTX' 2025-12-04T08:53:57.1289837Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T08:53:57.1308961Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:53:57.1322507Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T08:53:57.1331324Z Entering 'third_party/XNNPACK' 2025-12-04T08:53:57.1342261Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T08:53:57.1360317Z Entering 'third_party/aiter' 2025-12-04T08:53:57.1370969Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T08:53:57.1380678Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:53:57.1396394Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T08:53:57.1408574Z Entering 'third_party/benchmark' 2025-12-04T08:53:57.1419307Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:53:57.1439947Z Entering 'third_party/composable_kernel' 2025-12-04T08:53:57.1450392Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T08:53:57.1463349Z Entering 'third_party/cpp-httplib' 2025-12-04T08:53:57.1473646Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T08:53:57.1482376Z Entering 'third_party/cpuinfo' 2025-12-04T08:53:57.1492103Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T08:53:57.1500923Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:53:57.1510611Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T08:53:57.1519684Z Entering 'third_party/cutlass' 2025-12-04T08:53:57.1536628Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T08:53:57.1549144Z Entering 'third_party/fbgemm' 2025-12-04T08:53:57.1563954Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T08:53:57.1575009Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:53:57.1599852Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T08:53:57.1618730Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:53:57.1645001Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T08:53:57.1657936Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:53:57.1680307Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T08:53:57.1689824Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:53:57.1709875Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T08:53:57.1729473Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:53:57.1739790Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T08:53:57.1747191Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:53:57.1756221Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T08:53:57.1763974Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:53:57.1773035Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T08:53:57.1783484Z Entering 'third_party/flash-attention' 2025-12-04T08:53:57.1812032Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T08:53:57.1823347Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:53:57.1846633Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T08:53:57.1867317Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:53:57.1884121Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T08:53:57.1897590Z Entering 'third_party/flatbuffers' 2025-12-04T08:53:57.1908465Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T08:53:57.1920051Z Entering 'third_party/fmt' 2025-12-04T08:53:57.1936016Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T08:53:57.1946775Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:53:57.1965139Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T08:53:57.1987550Z Entering 'third_party/gloo' 2025-12-04T08:53:57.1999028Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T08:53:57.2008337Z Entering 'third_party/googletest' 2025-12-04T08:53:57.2029911Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:53:57.2042137Z Entering 'third_party/ideep' 2025-12-04T08:53:57.2061700Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T08:53:57.2070847Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:53:57.2091533Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T08:53:57.2105802Z Entering 'third_party/ittapi' 2025-12-04T08:53:57.2133866Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T08:53:57.2145086Z Entering 'third_party/kineto' 2025-12-04T08:53:57.2167734Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T08:53:57.2185787Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:53:57.2198508Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T08:53:57.2207543Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:53:57.2233029Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T08:53:57.2245152Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:53:57.2255736Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T08:53:57.2265517Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:53:57.2279239Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T08:53:57.2288115Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:53:57.2304897Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T08:53:57.2314165Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:53:57.2326526Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T08:53:57.2348860Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:53:57.2377902Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T08:53:57.2397213Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:53:57.2416741Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:53:57.2442401Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:53:57.2470180Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T08:53:57.2494894Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:53:57.2511194Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T08:53:57.2520620Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:53:57.2545336Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T08:53:57.2553381Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:53:57.2581856Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T08:53:57.2600961Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:53:57.2622764Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T08:53:57.2638544Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:53:57.2663996Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T08:53:57.2684277Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:53:57.2711992Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T08:53:57.2734265Z Entering 'third_party/kleidiai' 2025-12-04T08:53:57.2751901Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T08:53:57.2764513Z Entering 'third_party/mimalloc' 2025-12-04T08:53:57.2782073Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T08:53:57.2793253Z Entering 'third_party/nlohmann' 2025-12-04T08:53:57.2818663Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T08:53:57.2830210Z Entering 'third_party/onnx' 2025-12-04T08:53:57.2847427Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T08:53:57.2879722Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:53:57.2890708Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:53:57.2914776Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:53:57.2931992Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T08:53:57.2948525Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:53:57.2968318Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:53:57.2986031Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:53:57.3002257Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:53:57.3012221Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:53:57.3032003Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T08:53:57.3043085Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:53:57.3054333Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T08:53:57.3064789Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:53:57.3090986Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T08:53:57.3100959Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:53:57.3111852Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T08:53:57.3131895Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:53:57.3153686Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T08:53:57.3163514Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:53:57.3179175Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T08:53:57.3202338Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:53:57.3219109Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T08:53:57.3240722Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:53:57.3264973Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T08:53:57.3285813Z Entering 'third_party/pocketfft' 2025-12-04T08:53:57.3297802Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T08:53:57.3315849Z Entering 'third_party/protobuf' 2025-12-04T08:53:57.3335980Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T08:53:57.3349698Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:53:57.3369857Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:53:57.3390680Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:53:57.3401740Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:53:57.3425189Z Entering 'third_party/psimd' 2025-12-04T08:53:57.3445487Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T08:53:57.3457049Z Entering 'third_party/pthreadpool' 2025-12-04T08:53:57.3468971Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T08:53:57.3478557Z Entering 'third_party/pybind11' 2025-12-04T08:53:57.3496995Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:53:57.3519413Z Entering 'third_party/python-peachpy' 2025-12-04T08:53:57.3529819Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T08:53:57.3549553Z Entering 'third_party/sleef' 2025-12-04T08:53:57.3571403Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T08:53:57.3593381Z Entering 'third_party/tensorpipe' 2025-12-04T08:53:57.3612820Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T08:53:57.3636175Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:53:57.3658568Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:53:57.3675642Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:53:57.3701344Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T08:53:57.3717891Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:53:57.3737269Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T08:53:57.3759424Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:53:57.3776132Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:53:57.3786019Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:53:57.3807119Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T08:53:57.3847226Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.3877811Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.3908852Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.3931631Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.3969082Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.3992755Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4015188Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4040555Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4064267Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4086913Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4110625Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4133395Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4165643Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4241661Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4244379Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4259131Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4283178Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4321012Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4344943Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4378926Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4415310Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4450210Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4488634Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4522139Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4559447Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4583738Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4608920Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4646861Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4685365Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4728286Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4762127Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4796185Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4830299Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4867976Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4900216Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4943990Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.4980119Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5013050Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5045197Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5077601Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5103854Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5129140Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5161293Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5192050Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5224217Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5247862Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5267680Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5300693Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5329409Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5360903Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5390218Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5420400Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5440686Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5468436Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5494883Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5523113Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5558222Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5592181Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5623600Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5653589Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5680408Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5715762Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5744245Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5776459Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5808134Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5827762Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5862610Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5884370Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5920145Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5941203Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.5973717Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.6006761Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.6038643Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.6058730Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.6077146Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.6098701Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.6128964Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.6158802Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.6189196Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.6208029Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.6227483Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:53:57.6258127Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T08:53:57.6291211Z ##[endgroup] 2025-12-04T08:53:57.6291748Z ##[group]Fetching the repository 2025-12-04T08:53:57.6298166Z [command]/usr/bin/git -c protocol.version=2 fetch --prune --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/* 2025-12-04T08:53:58.3841030Z From https://github.com/pytorch/pytorch 2025-12-04T08:53:58.3841303Z - [deleted] (none) -> ciflow/inductor/160174 2025-12-04T08:53:58.3841487Z - [deleted] (none) -> ciflow/trunk/160174 2025-12-04T08:54:02.2070962Z * [new branch] 2.6.0.dev20241004+ -> origin/2.6.0.dev20241004+ 2025-12-04T08:54:02.2071635Z * [new branch] 2.9.1 -> origin/2.9.1 2025-12-04T08:54:02.2072367Z * [new branch] AaronWang04_addmmfusion_perftest -> origin/AaronWang04_addmmfusion_perftest 2025-12-04T08:54:02.2073119Z * [new branch] Flamefire-patch-1 -> origin/Flamefire-patch-1 2025-12-04T08:54:02.2073901Z * [new branch] HDCharles-2.6.0-release-notes -> origin/HDCharles-2.6.0-release-notes 2025-12-04T08:54:02.2074578Z * [new branch] HOPrintFunc -> origin/HOPrintFunc 2025-12-04T08:54:02.2075164Z * [new branch] IvanKobzarev/stack/1 -> origin/IvanKobzarev/stack/1 2025-12-04T08:54:02.2075757Z * [new branch] NicoshevSVE128 -> origin/NicoshevSVE128 2025-12-04T08:54:02.2076459Z * [new branch] PR-AOTInductorNoneBug -> origin/PR-AOTInductorNoneBug 2025-12-04T08:54:02.2077123Z * [new branch] PR-AOTInductorNoneBugFix -> origin/PR-AOTInductorNoneBugFix 2025-12-04T08:54:02.2077768Z * [new branch] PR-FixConfigsIssue -> origin/PR-FixConfigsIssue 2025-12-04T08:54:02.2079170Z * [new branch] PR-NoneBugFix-viable -> origin/PR-NoneBugFix-viable 2025-12-04T08:54:02.2079763Z * [new branch] PR-ResetToZero -> origin/PR-ResetToZero 2025-12-04T08:54:02.2080380Z * [new branch] Update-Flash-Packaging -> origin/Update-Flash-Packaging 2025-12-04T08:54:02.2080981Z * [new branch] VLA_exp -> origin/VLA_exp 2025-12-04T08:54:02.2081522Z * [new branch] activation_bench -> origin/activation_bench 2025-12-04T08:54:02.2082092Z * [new branch] addmm-heuristic -> origin/addmm-heuristic 2025-12-04T08:54:02.2082665Z * [new branch] adi/onednn_aarch64 -> origin/adi/onednn_aarch64 2025-12-04T08:54:02.2083214Z * [new branch] adi/test -> origin/adi/test 2025-12-04T08:54:02.2083744Z * [new branch] adi/test_bgemm -> origin/adi/test_bgemm 2025-12-04T08:54:02.2084285Z * [new branch] adi/test_m8g -> origin/adi/test_m8g 2025-12-04T08:54:02.2084835Z * [new branch] adi/test_onednn -> origin/adi/test_onednn 2025-12-04T08:54:02.2085396Z * [new branch] adi/test_onednn_v3.9 -> origin/adi/test_onednn_v3.9 2025-12-04T08:54:02.2086085Z * [new branch] adi/test_presve_change -> origin/adi/test_presve_change 2025-12-04T08:54:02.2086857Z * [new branch] adi/test_timm -> origin/adi/test_timm 2025-12-04T08:54:02.2087447Z * [new branch] adi/testpresve_change -> origin/adi/testpresve_change 2025-12-04T08:54:02.2088070Z * [new branch] aditew01/test/vec_bf16 -> origin/aditew01/test/vec_bf16 2025-12-04T08:54:02.2088705Z * [new branch] ah-globalfeedback-hook -> origin/ah-globalfeedback-hook 2025-12-04T08:54:02.2089352Z * [new branch] albanD-patch-1 -> origin/albanD-patch-1 2025-12-04T08:54:02.2089961Z * [new branch] also-surround-shimh -> origin/also-surround-shimh 2025-12-04T08:54:02.2090576Z * [new branch] angelayi/aot_compile -> origin/angelayi/aot_compile 2025-12-04T08:54:02.2091272Z * [new branch] angelayi/aoti_additional_files -> origin/angelayi/aoti_additional_files 2025-12-04T08:54:02.2091955Z * [new branch] angelayi/benchmark -> origin/angelayi/benchmark 2025-12-04T08:54:02.2092675Z * [new branch] angelayi/change_pytree_serialization -> origin/angelayi/change_pytree_serialization 2025-12-04T08:54:02.2093414Z * [new branch] angelayi/cpp_loader -> origin/angelayi/cpp_loader 2025-12-04T08:54:02.2094033Z * [new branch] angelayi/inductor_const -> origin/angelayi/inductor_const 2025-12-04T08:54:02.2094622Z * [new branch] angelayi/lstm -> origin/angelayi/lstm 2025-12-04T08:54:02.2095196Z * [new branch] angelayi/no_so_weight -> origin/angelayi/no_so_weight 2025-12-04T08:54:02.2095816Z * [new branch] angelayi/scan_layers -> origin/angelayi/scan_layers 2025-12-04T08:54:02.2096462Z * [new branch] angelayi/side_eff -> origin/angelayi/side_eff 2025-12-04T08:54:02.2097041Z * [new branch] angelayi/state_dict -> origin/angelayi/state_dict 2025-12-04T08:54:02.2097648Z * [new branch] angelayi/symint_input -> origin/angelayi/symint_input 2025-12-04T08:54:02.2098238Z * [new branch] angelayi/symm_mem -> origin/angelayi/symm_mem 2025-12-04T08:54:02.2098810Z * [new branch] angelayi/test_cpp -> origin/angelayi/test_cpp 2025-12-04T08:54:02.2099388Z * [new branch] angelayi/torch_size -> origin/angelayi/torch_size 2025-12-04T08:54:02.2099958Z * [new branch] annotate_assert -> origin/annotate_assert 2025-12-04T08:54:02.2100562Z * [new branch] annotate_fallback_kernel -> origin/annotate_fallback_kernel 2025-12-04T08:54:02.2101515Z * [new branch] annotation_deepcopy -> origin/annotation_deepcopy 2025-12-04T08:54:02.2102091Z * [new branch] annotation_dynamo -> origin/annotation_dynamo 2025-12-04T08:54:02.2102666Z * [new branch] aot_eager_stack_trace -> origin/aot_eager_stack_trace 2025-12-04T08:54:02.2103252Z * [new branch] aoti-cuda-alloc -> origin/aoti-cuda-alloc 2025-12-04T08:54:02.2103812Z * [new branch] aoti_const_device -> origin/aoti_const_device 2025-12-04T08:54:02.2104399Z * [new branch] aoti_fqn_name_interface -> origin/aoti_fqn_name_interface 2025-12-04T08:54:02.2105059Z * [new branch] aoti_package_weights_binary -> origin/aoti_package_weights_binary 2025-12-04T08:54:02.2105708Z * [new branch] aoti_target_windows -> origin/aoti_target_windows 2025-12-04T08:54:02.2106484Z * [new branch] arsh/feat/inductor_check_profiling -> origin/arsh/feat/inductor_check_profiling 2025-12-04T08:54:02.2107174Z * [new branch] async_tp -> origin/async_tp 2025-12-04T08:54:02.2107817Z * [new branch] atalman-inductor-perf-cu124 -> origin/atalman-inductor-perf-cu124 2025-12-04T08:54:02.2108586Z * [new branch] atalman-inductor-perf-cu124.1 -> origin/atalman-inductor-perf-cu124.1 2025-12-04T08:54:02.2109370Z * [new branch] atalman-patch-2 -> origin/atalman-patch-2 2025-12-04T08:54:02.2109943Z * [new branch] atalman-patch-3 -> origin/atalman-patch-3 2025-12-04T08:54:02.2110499Z * [new branch] atalman-patch-4 -> origin/atalman-patch-4 2025-12-04T08:54:02.2111054Z * [new branch] atalman-patch-5 -> origin/atalman-patch-5 2025-12-04T08:54:02.2111613Z * [new branch] atalman-patch-6 -> origin/atalman-patch-6 2025-12-04T08:54:02.2112221Z * [new branch] atalman-patch-7 -> origin/atalman-patch-7 2025-12-04T08:54:02.2112772Z * [new branch] atalman-patch-8 -> origin/atalman-patch-8 2025-12-04T08:54:02.2113361Z * [new branch] atalman_inductor_2.3.1 -> origin/atalman_inductor_2.3.1 2025-12-04T08:54:02.2113989Z * [new branch] atalman_inductor_2.4.0 -> origin/atalman_inductor_2.4.0 2025-12-04T08:54:02.2114612Z * [new branch] atalman_inductor_2.4.x -> origin/atalman_inductor_2.4.x 2025-12-04T08:54:02.2115447Z * [new branch] attention_benchmarking_clean -> origin/attention_benchmarking_clean 2025-12-04T08:54:02.2116204Z * [new branch] bahuang/dt_fix_scalar_add -> origin/bahuang/dt_fix_scalar_add 2025-12-04T08:54:02.2116842Z * [new branch] bahuang/fix_debug_mode -> origin/bahuang/fix_debug_mode 2025-12-04T08:54:02.2117453Z * [new branch] bahuang/fix_expand -> origin/bahuang/fix_expand 2025-12-04T08:54:02.2118028Z * [new branch] bahuang/test -> origin/bahuang/test 2025-12-04T08:54:02.2118557Z * [new branch] base/1.5 -> origin/base/1.5 2025-12-04T08:54:02.2119226Z * [new branch] batching_sdpa_efficient_attention -> origin/batching_sdpa_efficient_attention 2025-12-04T08:54:02.2119934Z * [new branch] bench_scaled_mm_ops -> origin/bench_scaled_mm_ops 2025-12-04T08:54:02.2120533Z * [new branch] benchmark-updates -> origin/benchmark-updates 2025-12-04T08:54:02.2121153Z * [new branch] benchmarking-script -> origin/benchmarking-script 2025-12-04T08:54:02.2121774Z * [new branch] bertmaher/pinbump26 -> origin/bertmaher/pinbump26 2025-12-04T08:54:02.2122354Z * [new branch] bertrand/cutlass -> origin/bertrand/cutlass 2025-12-04T08:54:02.2122939Z * [new branch] bf/bug-static-input -> origin/bf/bug-static-input 2025-12-04T08:54:02.2123637Z * [new branch] bf/cg-backend -> origin/bf/cg-backend 2025-12-04T08:54:02.2124188Z * [new branch] bf/cg-nccl-test -> origin/bf/cg-nccl-test 2025-12-04T08:54:02.2124761Z * [new branch] bf/cg-remove-check -> origin/bf/cg-remove-check 2025-12-04T08:54:02.2125374Z * [new branch] bf/clean-torchbench-hf -> origin/bf/clean-torchbench-hf 2025-12-04T08:54:02.2126061Z * [new branch] bf/combo-debug-log -> origin/bf/combo-debug-log 2025-12-04T08:54:02.2126639Z * [new branch] bf/cudagraph -> origin/bf/cudagraph 2025-12-04T08:54:02.2127371Z * [new branch] bf/cudagraph-disable-input-mutation -> origin/bf/cudagraph-disable-input-mutation 2025-12-04T08:54:02.2128512Z * [new branch] bf/cudagraph-enable-input-mutation-support-benchmark -> origin/bf/cudagraph-enable-input-mutation-support-benchmark 2025-12-04T08:54:02.2129522Z * [new branch] bf/cudagraph-partition -> origin/bf/cudagraph-partition 2025-12-04T08:54:02.2130173Z * [new branch] bf/donated-buffer-bench -> origin/bf/donated-buffer-bench 2025-12-04T08:54:02.2130803Z * [new branch] bf/dynamo-partition -> origin/bf/dynamo-partition 2025-12-04T08:54:02.2131372Z * [new branch] bf/lite -> origin/bf/lite 2025-12-04T08:54:02.2132070Z * [new branch] bf/pa-non-divisible -> origin/bf/pa-non-divisible 2025-12-04T08:54:02.2132791Z * [new branch] bf/partition-cache-free-symbols -> origin/bf/partition-cache-free-symbols 2025-12-04T08:54:02.2133553Z * [new branch] bf/partition-memory-plan -> origin/bf/partition-memory-plan 2025-12-04T08:54:02.2134212Z * [new branch] bf/partition-move-cpu -> origin/bf/partition-move-cpu 2025-12-04T08:54:02.2134895Z * [new branch] bf/partition-view-fallback -> origin/bf/partition-view-fallback 2025-12-04T08:54:02.2135599Z * [new branch] bf/remove-check-55b0c39d -> origin/bf/remove-check-55b0c39d 2025-12-04T08:54:02.2136282Z * [new branch] bf/timm-nov-26-2025 -> origin/bf/timm-nov-26-2025 2025-12-04T08:54:02.2136940Z * [new branch] bf/transformer-pin-4-57-3 -> origin/bf/transformer-pin-4-57-3 2025-12-04T08:54:02.2137666Z * [new branch] bisect_perf_hf_T5_3acc6eac492 -> origin/bisect_perf_hf_T5_3acc6eac492 2025-12-04T08:54:02.2138377Z * [new branch] bisect_perf_hf_T5_3fcf66f61fb -> origin/bisect_perf_hf_T5_3fcf66f61fb 2025-12-04T08:54:02.2139071Z * [new branch] bisect_perf_hf_T5_4009d154129 -> origin/bisect_perf_hf_T5_4009d154129 2025-12-04T08:54:02.2139765Z * [new branch] bisect_perf_hf_T5_40d0740e73d -> origin/bisect_perf_hf_T5_40d0740e73d 2025-12-04T08:54:02.2140433Z * [new branch] bisect_perf_hf_T5_5268754e -> origin/bisect_perf_hf_T5_5268754e 2025-12-04T08:54:02.2141116Z * [new branch] bisect_perf_hf_T5_7d89a8d385c -> origin/bisect_perf_hf_T5_7d89a8d385c 2025-12-04T08:54:02.2141806Z * [new branch] bisect_perf_hf_T5_b7a25c1ee7c -> origin/bisect_perf_hf_T5_b7a25c1ee7c 2025-12-04T08:54:02.2142495Z * [new branch] bisect_perf_hf_T5_c25b201583f -> origin/bisect_perf_hf_T5_c25b201583f 2025-12-04T08:54:02.2143180Z * [new branch] bisect_perf_hf_T5_c93e57efac0 -> origin/bisect_perf_hf_T5_c93e57efac0 2025-12-04T08:54:02.2143861Z * [new branch] bisect_perf_hf_T5_ca9813ea149 -> origin/bisect_perf_hf_T5_ca9813ea149 2025-12-04T08:54:02.2144532Z * [new branch] bisect_perf_hf_T5_d65f194a -> origin/bisect_perf_hf_T5_d65f194a 2025-12-04T08:54:02.2145196Z * [new branch] bisect_perf_hf_T5_da94ab0b -> origin/bisect_perf_hf_T5_da94ab0b 2025-12-04T08:54:02.2145874Z * [new branch] bisect_perf_hf_T5_da94ab0b_new -> origin/bisect_perf_hf_T5_da94ab0b_new 2025-12-04T08:54:02.2146730Z * [new branch] bisect_perf_hf_T5_db4e8a1d8a8 -> origin/bisect_perf_hf_T5_db4e8a1d8a8 2025-12-04T08:54:02.2147417Z * [new branch] bisect_perf_hf_T5_e0d97e936a2 -> origin/bisect_perf_hf_T5_e0d97e936a2 2025-12-04T08:54:02.2148093Z * [new branch] bisect_perf_hf_T5_f23621ec563 -> origin/bisect_perf_hf_T5_f23621ec563 2025-12-04T08:54:02.2148754Z * [new branch] brister/fx_device_type -> origin/brister/fx_device_type 2025-12-04T08:54:02.2149448Z * [new branch] brister/test_inductor_all_fx -> origin/brister/test_inductor_all_fx 2025-12-04T08:54:02.2150247Z * [new branch] brister/tiled_reduction_no_numel_check -> origin/brister/tiled_reduction_no_numel_check 2025-12-04T08:54:02.2150979Z * [new branch] bwd-backup -> origin/bwd-backup 2025-12-04T08:54:02.2151521Z * [new branch] c57382a49 -> origin/c57382a49 2025-12-04T08:54:02.2152042Z * [new branch] ca_0431d47eaa -> origin/ca_0431d47eaa 2025-12-04T08:54:02.2152596Z * [new branch] ca_fix_0431d47eaa -> origin/ca_fix_0431d47eaa 2025-12-04T08:54:02.2153250Z * [new branch] camyllh/test_setup_hooks_push -> origin/camyllh/test_setup_hooks_push 2025-12-04T08:54:02.2153909Z * [new branch] cccclai-patch-1 -> origin/cccclai-patch-1 2025-12-04T08:54:02.2154792Z * [new branch] cherry-pick-159969-by-pytorch_bot_bot_ -> origin/cherry-pick-159969-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2155704Z * [new branch] cherry-pick-160586-by-pytorch_bot_bot_ -> origin/cherry-pick-160586-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2156648Z * [new branch] cherry-pick-162208-by-pytorch_bot_bot_ -> origin/cherry-pick-162208-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2157527Z * [new branch] cherry-pick-163169-by-pytorch_bot_bot_ -> origin/cherry-pick-163169-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2158407Z * [new branch] cherry-pick-165086-by-pytorch_bot_bot_ -> origin/cherry-pick-165086-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2159272Z * [new branch] cherry-pick-165514-by-pytorch_bot_bot_ -> origin/cherry-pick-165514-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2160153Z * [new branch] cherry-pick-165601-by-pytorch_bot_bot_ -> origin/cherry-pick-165601-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2161034Z * [new branch] cherry-pick-165667-by-pytorch_bot_bot_ -> origin/cherry-pick-165667-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2161900Z * [new branch] cherry-pick-165815-by-pytorch_bot_bot_ -> origin/cherry-pick-165815-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2162769Z * [new branch] cherry-pick-165922-by-pytorch_bot_bot_ -> origin/cherry-pick-165922-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2163638Z * [new branch] cherry-pick-166148-by-pytorch_bot_bot_ -> origin/cherry-pick-166148-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2164514Z * [new branch] cherry-pick-166181-by-pytorch_bot_bot_ -> origin/cherry-pick-166181-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2165383Z * [new branch] cherry-pick-166404-by-pytorch_bot_bot_ -> origin/cherry-pick-166404-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2166323Z * [new branch] cherry-pick-166427-by-pytorch_bot_bot_ -> origin/cherry-pick-166427-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2167201Z * [new branch] cherry-pick-166480-by-pytorch_bot_bot_ -> origin/cherry-pick-166480-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2168060Z * [new branch] cherry-pick-166570-by-pytorch_bot_bot_ -> origin/cherry-pick-166570-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2168931Z * [new branch] cherry-pick-166993-by-pytorch_bot_bot_ -> origin/cherry-pick-166993-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2169799Z * [new branch] cherry-pick-167111-by-pytorch_bot_bot_ -> origin/cherry-pick-167111-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2170765Z * [new branch] cherry-pick-167478-by-pytorch_bot_bot_ -> origin/cherry-pick-167478-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2171523Z * [new branch] cherry_pick_166036_166040 -> origin/cherry_pick_166036_166040 2025-12-04T08:54:02.2172135Z * [new branch] cherry_pick_166457 -> origin/cherry_pick_166457 2025-12-04T08:54:02.2172722Z * [new branch] cherrypick_166338 -> origin/cherrypick_166338 2025-12-04T08:54:02.2173296Z * [new branch] cherrypick_166458 -> origin/cherrypick_166458 2025-12-04T08:54:02.2173860Z * [new branch] cherrypick_166586 -> origin/cherrypick_166586 2025-12-04T08:54:02.2174413Z * [new branch] cherrypick_166956 -> origin/cherrypick_166956 2025-12-04T08:54:02.2174959Z * [new branch] ci_attn -> origin/ci_attn 2025-12-04T08:54:02.2175491Z * [new branch] codex-testing -> origin/codex-testing 2025-12-04T08:54:02.2176418Z * [new branch] codex/add-check_memory_overlap-helper-functions -> origin/codex/add-check_memory_overlap-helper-functions 2025-12-04T08:54:02.2177385Z * [new branch] codex/fix-issue-121219-in-pytorch -> origin/codex/fix-issue-121219-in-pytorch 2025-12-04T08:54:02.2178507Z * [new branch] codex/investigate-segfaults-in-get_tensor_storage_id -> origin/codex/investigate-segfaults-in-get_tensor_storage_id 2025-12-04T08:54:02.2179680Z * [new branch] codex/refactor-lintrunner-config-to-use-uv-run -> origin/codex/refactor-lintrunner-config-to-use-uv-run 2025-12-04T08:54:02.2180543Z * [new branch] compatiblpy39util -> origin/compatiblpy39util 2025-12-04T08:54:02.2181122Z * [new branch] cond_hop_device -> origin/cond_hop_device 2025-12-04T08:54:02.2181671Z * [new branch] context_test -> origin/context_test 2025-12-04T08:54:02.2182422Z * [new branch] copilot/code-style-cleanup-python-pip -> origin/copilot/code-style-cleanup-python-pip 2025-12-04T08:54:02.2183202Z * [new branch] cpio/fix_new_ami_tests -> origin/cpio/fix_new_ami_tests 2025-12-04T08:54:02.2183909Z * [new branch] cpp-docs-dependency-upgrade -> origin/cpp-docs-dependency-upgrade 2025-12-04T08:54:02.2184609Z * [new branch] csl/always_produce_xml -> origin/csl/always_produce_xml 2025-12-04T08:54:02.2185583Z * [new branch] csl/build_test_more_procs -> origin/csl/build_test_more_procs 2025-12-04T08:54:02.2186341Z * [new branch] csl/build_test_more_procs2 -> origin/csl/build_test_more_procs2 2025-12-04T08:54:02.2186947Z * [new branch] csl/clean_up -> origin/csl/clean_up 2025-12-04T08:54:02.2187559Z * [new branch] csl/fix_retry_segfault_exit -> origin/csl/fix_retry_segfault_exit 2025-12-04T08:54:02.2188156Z * [new branch] csl/katex -> origin/csl/katex 2025-12-04T08:54:02.2188708Z * [new branch] csl/larger_runner -> origin/csl/larger_runner 2025-12-04T08:54:02.2189271Z * [new branch] csl/lint_testing -> origin/csl/lint_testing 2025-12-04T08:54:02.2189818Z * [new branch] csl/lint_thing -> origin/csl/lint_thing 2025-12-04T08:54:02.2190403Z * [new branch] csl/lintrunner_stuff -> origin/csl/lintrunner_stuff 2025-12-04T08:54:02.2191016Z * [new branch] csl/manually_gen_json -> origin/csl/manually_gen_json 2025-12-04T08:54:02.2191591Z * [new branch] csl/mps_sharding -> origin/csl/mps_sharding 2025-12-04T08:54:02.2192177Z * [new branch] csl/multistage_docker -> origin/csl/multistage_docker 2025-12-04T08:54:02.2192761Z * [new branch] csl/print_timing -> origin/csl/print_timing 2025-12-04T08:54:02.2193338Z * [new branch] csl/remove_experiment -> origin/csl/remove_experiment 2025-12-04T08:54:02.2194101Z * [new branch] csl/remove_maybe_unused_var -> origin/csl/remove_maybe_unused_var 2025-12-04T08:54:02.2194846Z * [new branch] csl/remove_repo_specific_autolabel -> origin/csl/remove_repo_specific_autolabel 2025-12-04T08:54:02.2195561Z * [new branch] csl/remove_run_parallel -> origin/csl/remove_run_parallel 2025-12-04T08:54:02.2196268Z * [new branch] csl/remove_unused_vars -> origin/csl/remove_unused_vars 2025-12-04T08:54:02.2196847Z * [new branch] csl/revert_open -> origin/csl/revert_open 2025-12-04T08:54:02.2197396Z * [new branch] csl/skip_build -> origin/csl/skip_build 2025-12-04T08:54:02.2198017Z * [new branch] csl/smaller_avx_amx_runenrs -> origin/csl/smaller_avx_amx_runenrs 2025-12-04T08:54:02.2198623Z * [new branch] csl/td_job_level -> origin/csl/td_job_level 2025-12-04T08:54:02.2199291Z * [new branch] csl/test_cuda_build_large_runner -> origin/csl/test_cuda_build_large_runner 2025-12-04T08:54:02.2200095Z * [new branch] csl/test_owners_autograd_dispatch_nn -> origin/csl/test_owners_autograd_dispatch_nn 2025-12-04T08:54:02.2200906Z * [new branch] csl/test_owners_higher_confidence -> origin/csl/test_owners_higher_confidence 2025-12-04T08:54:02.2201703Z * [new branch] csl/upload_json_running -> origin/csl/upload_json_running 2025-12-04T08:54:02.2202297Z * [new branch] csl/win_sccache -> origin/csl/win_sccache 2025-12-04T08:54:02.2202839Z * [new branch] csl/xml_stuff -> origin/csl/xml_stuff 2025-12-04T08:54:02.2203380Z * [new branch] cublasrelax2 -> origin/cublasrelax2 2025-12-04T08:54:02.2203923Z * [new branch] cuda_mempool -> origin/cuda_mempool 2025-12-04T08:54:02.2204485Z * [new branch] custom_lowering_dict -> origin/custom_lowering_dict 2025-12-04T08:54:02.2205127Z * [new branch] d4l3k/debug_plane_frtrace -> origin/d4l3k/debug_plane_frtrace 2025-12-04T08:54:02.2205726Z * [new branch] daxia6/2.8o3 -> origin/daxia6/2.8o3 2025-12-04T08:54:02.2206344Z * [new branch] debug-guard -> origin/debug-guard 2025-12-04T08:54:02.2206923Z * [new branch] delete-quant-docs -> origin/delete-quant-docs 2025-12-04T08:54:02.2207978Z * [new branch] dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.0 -> origin/dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.0 2025-12-04T08:54:02.2209439Z * [new branch] dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.1 -> origin/dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.1 2025-12-04T08:54:02.2210522Z * [new branch] desertfire/test_cpp_wrapper -> origin/desertfire/test_cpp_wrapper 2025-12-04T08:54:02.2211304Z * [new branch] desertfire/triton-cpu-for-aarch64 -> origin/desertfire/triton-cpu-for-aarch64 2025-12-04T08:54:02.2212045Z * [new branch] dev/dhruva/flex_attn_opt -> origin/dev/dhruva/flex_attn_opt 2025-12-04T08:54:02.2212698Z * [new branch] dev/joona/MPSNDArrayAdd -> origin/dev/joona/MPSNDArrayAdd 2025-12-04T08:54:02.2213322Z * [new branch] dev/joona/Unranked -> origin/dev/joona/Unranked 2025-12-04T08:54:02.2213890Z * [new branch] dev/joona/cat -> origin/dev/joona/cat 2025-12-04T08:54:02.2214476Z * [new branch] dev/joona/embeddingbag -> origin/dev/joona/embeddingbag 2025-12-04T08:54:02.2215128Z * [new branch] dev/joona/fix_sdpa_memtest -> origin/dev/joona/fix_sdpa_memtest 2025-12-04T08:54:02.2215825Z * [new branch] dev/joona/getTensorsString -> origin/dev/joona/getTensorsString 2025-12-04T08:54:02.2216625Z * [new branch] dev/joona/mps_linear_macos14 -> origin/dev/joona/mps_linear_macos14 2025-12-04T08:54:02.2217395Z * [new branch] dev/joona/scalar_clamp -> origin/dev/joona/scalar_clamp 2025-12-04T08:54:02.2217980Z * [new branch] dev/joona/sdpa -> origin/dev/joona/sdpa 2025-12-04T08:54:02.2218548Z * [new branch] dev/joona/sdpa_api -> origin/dev/joona/sdpa_api 2025-12-04T08:54:02.2219133Z * [new branch] dev/joona/type_inf -> origin/dev/joona/type_inf 2025-12-04T08:54:02.2219757Z * [new branch] dev/joona/ulpAssertClose -> origin/dev/joona/ulpAssertClose 2025-12-04T08:54:02.2220385Z * [new branch] dev/joona/upsize3d -> origin/dev/joona/upsize3d 2025-12-04T08:54:02.2220942Z * [new branch] disp_counter -> origin/disp_counter 2025-12-04T08:54:02.2221504Z * [new branch] divyanshk-patch-1 -> origin/divyanshk-patch-1 2025-12-04T08:54:02.2222059Z * [new branch] docs -> origin/docs 2025-12-04T08:54:02.2222581Z * [new branch] documentation -> origin/documentation 2025-12-04T08:54:02.2223162Z * [new branch] eager_model_benchmarks -> origin/eager_model_benchmarks 2025-12-04T08:54:02.2223835Z * [new branch] embg/test_inductor_ci_control -> origin/embg/test_inductor_ci_control 2025-12-04T08:54:02.2224635Z * [new branch] embg/triton_l2_prefetch_128B -> origin/embg/triton_l2_prefetch_128B 2025-12-04T08:54:02.2225332Z * [new branch] embg/triton_l2_prefetch_256B -> origin/embg/triton_l2_prefetch_256B 2025-12-04T08:54:02.2226022Z * [new branch] eqy-patch-1 -> origin/eqy-patch-1 2025-12-04T08:54:02.2226563Z * [new branch] eqy-patch-2 -> origin/eqy-patch-2 2025-12-04T08:54:02.2227090Z * [new branch] eqy-patch-3 -> origin/eqy-patch-3 2025-12-04T08:54:02.2227610Z * [new branch] eqy-patch-4 -> origin/eqy-patch-4 2025-12-04T08:54:02.2228129Z * [new branch] eqy-patch-5 -> origin/eqy-patch-5 2025-12-04T08:54:02.2228650Z * [new branch] eqy-patch-6 -> origin/eqy-patch-6 2025-12-04T08:54:02.2229219Z * [new branch] exclamaforte/amd-ma -> origin/exclamaforte/amd-ma 2025-12-04T08:54:02.2229966Z * [new branch] exclamaforte/combo-kernels-perf-run -> origin/exclamaforte/combo-kernels-perf-run 2025-12-04T08:54:02.2230791Z * [new branch] exclamaforte/do_bench_refactor -> origin/exclamaforte/do_bench_refactor 2025-12-04T08:54:02.2231599Z * [new branch] exclamaforte/enable-mem-dep-fusion -> origin/exclamaforte/enable-mem-dep-fusion 2025-12-04T08:54:02.2232500Z * [new branch] exclamaforte/fix-exhaustive-autotuning -> origin/exclamaforte/fix-exhaustive-autotuning 2025-12-04T08:54:02.2233429Z * [new branch] exclamaforte/fix-trace-parsing-fx-svg -> origin/exclamaforte/fix-trace-parsing-fx-svg 2025-12-04T08:54:02.2234399Z * [new branch] exclamaforte/force-pointwise-cat-perf-run -> origin/exclamaforte/force-pointwise-cat-perf-run 2025-12-04T08:54:02.2235234Z * [new branch] exclamaforte/fusion-data -> origin/exclamaforte/fusion-data 2025-12-04T08:54:02.2236032Z * [new branch] exclamaforte/gemm-benchmark-run -> origin/exclamaforte/gemm-benchmark-run 2025-12-04T08:54:02.2236836Z * [new branch] exclamaforte/gemm-export-model -> origin/exclamaforte/gemm-export-model 2025-12-04T08:54:02.2237548Z * [new branch] exclamaforte/gemm-model -> origin/exclamaforte/gemm-model 2025-12-04T08:54:02.2238400Z * [new branch] exclamaforte/gemm-model-all-data-collection -> origin/exclamaforte/gemm-model-all-data-collection 2025-12-04T08:54:02.2239248Z * [new branch] exclamaforte/gemm-to-amd -> origin/exclamaforte/gemm-to-amd 2025-12-04T08:54:02.2240063Z * [new branch] exclamaforte/just-gemm-model -> origin/exclamaforte/just-gemm-model 2025-12-04T08:54:02.2240916Z * [new branch] exclamaforte/just-gemm-model-no-refactor -> origin/exclamaforte/just-gemm-model-no-refactor 2025-12-04T08:54:02.2241784Z * [new branch] exclamaforte/profile-diff-algo -> origin/exclamaforte/profile-diff-algo 2025-12-04T08:54:02.2242624Z * [new branch] exclamaforte/profiler-visualization -> origin/exclamaforte/profiler-visualization 2025-12-04T08:54:02.2243469Z * [new branch] exclamaforte/test_cpp_wrapper_mode -> origin/exclamaforte/test_cpp_wrapper_mode 2025-12-04T08:54:02.2244330Z * [new branch] exclamaforte/update-autotune-configs -> origin/exclamaforte/update-autotune-configs 2025-12-04T08:54:02.2245237Z * [new branch] exclamaforte/update-autotune-configs-2 -> origin/exclamaforte/update-autotune-configs-2 2025-12-04T08:54:02.2246027Z * [new branch] exec -> origin/exec 2025-12-04T08:54:02.2246593Z * [new branch] experimental-mosaic -> origin/experimental-mosaic 2025-12-04T08:54:02.2247186Z * [new branch] export-D61047529 -> origin/export-D61047529 2025-12-04T08:54:02.2247756Z * [new branch] export-D71412006 -> origin/export-D71412006 2025-12-04T08:54:02.2248401Z * [new branch] export-D73042989 -> origin/export-D73042989 2025-12-04T08:54:02.2249075Z * [new branch] export-D78957093 -> origin/export-D78957093 2025-12-04T08:54:02.2249629Z * [new branch] export-D78996107 -> origin/export-D78996107 2025-12-04T08:54:02.2250178Z * [new branch] export-D80823877 -> origin/export-D80823877 2025-12-04T08:54:02.2250724Z * [new branch] export-D80958642 -> origin/export-D80958642 2025-12-04T08:54:02.2251271Z * [new branch] export-D81054193 -> origin/export-D81054193 2025-12-04T08:54:02.2251822Z * [new branch] export-D81204584 -> origin/export-D81204584 2025-12-04T08:54:02.2252369Z * [new branch] export-D81429090 -> origin/export-D81429090 2025-12-04T08:54:02.2252910Z * [new branch] export-D82250826 -> origin/export-D82250826 2025-12-04T08:54:02.2253451Z * [new branch] export-D82253817 -> origin/export-D82253817 2025-12-04T08:54:02.2253996Z * [new branch] export-D83541846 -> origin/export-D83541846 2025-12-04T08:54:02.2254542Z * [new branch] export-D83627170 -> origin/export-D83627170 2025-12-04T08:54:02.2255081Z * [new branch] export-D83766701 -> origin/export-D83766701 2025-12-04T08:54:02.2255626Z * [new branch] export-D83768878 -> origin/export-D83768878 2025-12-04T08:54:02.2256229Z * [new branch] export-D83769447 -> origin/export-D83769447 2025-12-04T08:54:02.2256784Z * [new branch] export-D84089824 -> origin/export-D84089824 2025-12-04T08:54:02.2257328Z * [new branch] export-D84213020 -> origin/export-D84213020 2025-12-04T08:54:02.2257868Z * [new branch] export-D84373821 -> origin/export-D84373821 2025-12-04T08:54:02.2258403Z * [new branch] export-D84612194 -> origin/export-D84612194 2025-12-04T08:54:02.2258956Z * [new branch] export-D84890985 -> origin/export-D84890985 2025-12-04T08:54:02.2259502Z * [new branch] export-D85122326 -> origin/export-D85122326 2025-12-04T08:54:02.2260039Z * [new branch] export-D86256198 -> origin/export-D86256198 2025-12-04T08:54:02.2260578Z * [new branch] export-D86460608 -> origin/export-D86460608 2025-12-04T08:54:02.2261125Z * [new branch] export-D86474796 -> origin/export-D86474796 2025-12-04T08:54:02.2261662Z * [new branch] export-D86712396 -> origin/export-D86712396 2025-12-04T08:54:02.2262309Z * [new branch] export-D87022129 -> origin/export-D87022129 2025-12-04T08:54:02.2262846Z * [new branch] export-D87838959 -> origin/export-D87838959 2025-12-04T08:54:02.2263388Z * [new branch] export-D88319437 -> origin/export-D88319437 2025-12-04T08:54:02.2264092Z * [new branch] exported-model-train-idempotent -> origin/exported-model-train-idempotent 2025-12-04T08:54:02.2264829Z * [new branch] ezyang-titan-october -> origin/ezyang-titan-october 2025-12-04T08:54:02.2265463Z * [new branch] ezyang-titan-october2 -> origin/ezyang-titan-october2 2025-12-04T08:54:02.2266113Z * [new branch] ezyang-war -> origin/ezyang-war 2025-12-04T08:54:02.2266734Z * [new branch] ezyang/wip-aot-descriptors -> origin/ezyang/wip-aot-descriptors 2025-12-04T08:54:02.2267369Z * [new branch] fa_u8_brgemm -> origin/fa_u8_brgemm 2025-12-04T08:54:02.2267976Z * [new branch] fadeputr/sequence_fbgemm -> origin/fadeputr/sequence_fbgemm 2025-12-04T08:54:02.2268590Z * [new branch] fastmath_baseline -> origin/fastmath_baseline 2025-12-04T08:54:02.2269143Z * [new branch] fbcode/warm -> origin/fbcode/warm 2025-12-04T08:54:02.2269761Z * [new branch] fca -> origin/fca 2025-12-04T08:54:02.2270265Z * [new branch] fca2_ca5984c -> origin/fca2_ca5984c 2025-12-04T08:54:02.2270774Z * [new branch] fca5 -> origin/fca5 2025-12-04T08:54:02.2271332Z * [new branch] feature/justknobs-cpp -> origin/feature/justknobs-cpp 2025-12-04T08:54:02.2271964Z * [new branch] feature/numa-forkserver -> origin/feature/numa-forkserver 2025-12-04T08:54:02.2272577Z * [new branch] ffast_math_baseline -> origin/ffast_math_baseline 2025-12-04T08:54:02.2273154Z * [new branch] ffast_math_target -> origin/ffast_math_target 2025-12-04T08:54:02.2273732Z * [new branch] findhao/base_commit -> origin/findhao/base_commit 2025-12-04T08:54:02.2274325Z * [new branch] findhao/base_commit1 -> origin/findhao/base_commit1 2025-12-04T08:54:02.2274932Z * [new branch] findhao/multistream2 -> origin/findhao/multistream2 2025-12-04T08:54:02.2275539Z * [new branch] findhao/multistream5 -> origin/findhao/multistream5 2025-12-04T08:54:02.2276192Z * [new branch] findhao/multistream6 -> origin/findhao/multistream6 2025-12-04T08:54:02.2276810Z * [new branch] findhao/operatorbench3 -> origin/findhao/operatorbench3 2025-12-04T08:54:02.2277443Z * [new branch] findhao/operatorbench5 -> origin/findhao/operatorbench5 2025-12-04T08:54:02.2278063Z * [new branch] findhao/tritonparse -> origin/findhao/tritonparse 2025-12-04T08:54:02.2278739Z * [new branch] fix-ck-gemm-template-format -> origin/fix-ck-gemm-template-format 2025-12-04T08:54:02.2279413Z * [new branch] fix-config-ignore -> origin/fix-config-ignore 2025-12-04T08:54:02.2279992Z * [new branch] fix-dict-guard -> origin/fix-dict-guard 2025-12-04T08:54:02.2280547Z * [new branch] fix_addmm_issue -> origin/fix_addmm_issue 2025-12-04T08:54:02.2281166Z * [new branch] fix_amd_missing_cluster_dims -> origin/fix_amd_missing_cluster_dims 2025-12-04T08:54:02.2281802Z * [new branch] fix_bench_bwd_pass -> origin/fix_bench_bwd_pass 2025-12-04T08:54:02.2282402Z * [new branch] fix_mem_profiler_config -> origin/fix_mem_profiler_config 2025-12-04T08:54:02.2282997Z * [new branch] fix_nvrtc_discovery -> origin/fix_nvrtc_discovery 2025-12-04T08:54:02.2283557Z * [new branch] fix_op_runner -> origin/fix_op_runner 2025-12-04T08:54:02.2284189Z * [new branch] fix_ubn_159469 -> origin/fix_ubn_159469 2025-12-04T08:54:02.2284728Z * [new branch] fixes-triage -> origin/fixes-triage 2025-12-04T08:54:02.2285276Z * [new branch] fixflashinfer -> origin/fixflashinfer 2025-12-04T08:54:02.2285850Z * [new branch] flash_decoding_cpu -> origin/flash_decoding_cpu 2025-12-04T08:54:02.2286473Z * [new branch] flex-flash -> origin/flex-flash 2025-12-04T08:54:02.2287103Z * [new branch] flex_attention_functorch_grad -> origin/flex_attention_functorch_grad 2025-12-04T08:54:02.2287726Z * [new branch] flex_flash -> origin/flex_flash 2025-12-04T08:54:02.2288366Z * [new branch] fmassa/fix_memeff_sharding_rule -> origin/fmassa/fix_memeff_sharding_rule 2025-12-04T08:54:02.2289157Z * [new branch] fmassa/tests_comm_compute_scheduler -> origin/fmassa/tests_comm_compute_scheduler 2025-12-04T08:54:02.2289857Z * [new branch] forkserver_fix -> origin/forkserver_fix 2025-12-04T08:54:02.2290412Z * [new branch] fsdp2_trace_rules -> origin/fsdp2_trace_rules 2025-12-04T08:54:02.2290967Z * [new branch] fx_cpp -> origin/fx_cpp 2025-12-04T08:54:02.2291572Z * [new branch] fy/fix-win -> origin/fy/fix-win 2025-12-04T08:54:02.2292108Z * [new branch] galv-patch-1 -> origin/galv-patch-1 2025-12-04T08:54:02.2292872Z * [new branch] galv/cudagraphs-conditional-nodes-4 -> origin/galv/cudagraphs-conditional-nodes-4 2025-12-04T08:54:02.2293705Z * [new branch] georgehong/cmakelists-patch -> origin/georgehong/cmakelists-patch 2025-12-04T08:54:02.2294377Z * [new branch] gh/AlnisM/1/base -> origin/gh/AlnisM/1/base 2025-12-04T08:54:02.2294950Z * [new branch] gh/AlnisM/1/head -> origin/gh/AlnisM/1/head 2025-12-04T08:54:02.2295546Z * [new branch] gh/EikanWang/67/base -> origin/gh/EikanWang/67/base 2025-12-04T08:54:02.2296215Z * [new branch] gh/EikanWang/67/head -> origin/gh/EikanWang/67/head 2025-12-04T08:54:02.2296808Z * [new branch] gh/Gasoonjia/1/base -> origin/gh/Gasoonjia/1/base 2025-12-04T08:54:02.2297399Z * [new branch] gh/Gasoonjia/1/head -> origin/gh/Gasoonjia/1/head 2025-12-04T08:54:02.2297984Z * [new branch] gh/H-Huang/131/base -> origin/gh/H-Huang/131/base 2025-12-04T08:54:02.2298550Z * [new branch] gh/H-Huang/131/head -> origin/gh/H-Huang/131/head 2025-12-04T08:54:02.2299125Z * [new branch] gh/H-Huang/131/orig -> origin/gh/H-Huang/131/orig 2025-12-04T08:54:02.2299689Z * [new branch] gh/H-Huang/132/base -> origin/gh/H-Huang/132/base 2025-12-04T08:54:02.2300258Z * [new branch] gh/H-Huang/132/head -> origin/gh/H-Huang/132/head 2025-12-04T08:54:02.2300829Z * [new branch] gh/H-Huang/132/orig -> origin/gh/H-Huang/132/orig 2025-12-04T08:54:02.2301395Z * [new branch] gh/H-Huang/180/base -> origin/gh/H-Huang/180/base 2025-12-04T08:54:02.2301961Z * [new branch] gh/H-Huang/180/head -> origin/gh/H-Huang/180/head 2025-12-04T08:54:02.2302538Z * [new branch] gh/H-Huang/180/orig -> origin/gh/H-Huang/180/orig 2025-12-04T08:54:02.2303104Z * [new branch] gh/H-Huang/182/base -> origin/gh/H-Huang/182/base 2025-12-04T08:54:02.2303666Z * [new branch] gh/H-Huang/182/head -> origin/gh/H-Huang/182/head 2025-12-04T08:54:02.2304237Z * [new branch] gh/H-Huang/182/orig -> origin/gh/H-Huang/182/orig 2025-12-04T08:54:02.2304804Z * [new branch] gh/H-Huang/226/base -> origin/gh/H-Huang/226/base 2025-12-04T08:54:02.2305458Z * [new branch] gh/H-Huang/226/head -> origin/gh/H-Huang/226/head 2025-12-04T08:54:02.2306097Z * [new branch] gh/H-Huang/226/orig -> origin/gh/H-Huang/226/orig 2025-12-04T08:54:02.2306663Z * [new branch] gh/H-Huang/228/base -> origin/gh/H-Huang/228/base 2025-12-04T08:54:02.2307232Z * [new branch] gh/H-Huang/228/head -> origin/gh/H-Huang/228/head 2025-12-04T08:54:02.2307806Z * [new branch] gh/H-Huang/228/orig -> origin/gh/H-Huang/228/orig 2025-12-04T08:54:02.2308419Z * [new branch] gh/IvanKobzarev/150/base -> origin/gh/IvanKobzarev/150/base 2025-12-04T08:54:02.2309082Z * [new branch] gh/IvanKobzarev/150/head -> origin/gh/IvanKobzarev/150/head 2025-12-04T08:54:02.2309723Z * [new branch] gh/IvanKobzarev/150/orig -> origin/gh/IvanKobzarev/150/orig 2025-12-04T08:54:02.2310360Z * [new branch] gh/IvanKobzarev/157/base -> origin/gh/IvanKobzarev/157/base 2025-12-04T08:54:02.2311010Z * [new branch] gh/IvanKobzarev/157/head -> origin/gh/IvanKobzarev/157/head 2025-12-04T08:54:02.2311649Z * [new branch] gh/IvanKobzarev/157/orig -> origin/gh/IvanKobzarev/157/orig 2025-12-04T08:54:02.2312278Z * [new branch] gh/IvanKobzarev/159/base -> origin/gh/IvanKobzarev/159/base 2025-12-04T08:54:02.2313003Z * [new branch] gh/IvanKobzarev/159/head -> origin/gh/IvanKobzarev/159/head 2025-12-04T08:54:02.2313644Z * [new branch] gh/IvanKobzarev/159/orig -> origin/gh/IvanKobzarev/159/orig 2025-12-04T08:54:02.2314271Z * [new branch] gh/IvanKobzarev/162/base -> origin/gh/IvanKobzarev/162/base 2025-12-04T08:54:02.2314900Z * [new branch] gh/IvanKobzarev/162/head -> origin/gh/IvanKobzarev/162/head 2025-12-04T08:54:02.2315535Z * [new branch] gh/IvanKobzarev/162/orig -> origin/gh/IvanKobzarev/162/orig 2025-12-04T08:54:02.2316214Z * [new branch] gh/IvanKobzarev/163/base -> origin/gh/IvanKobzarev/163/base 2025-12-04T08:54:02.2316851Z * [new branch] gh/IvanKobzarev/163/head -> origin/gh/IvanKobzarev/163/head 2025-12-04T08:54:02.2317477Z * [new branch] gh/IvanKobzarev/163/orig -> origin/gh/IvanKobzarev/163/orig 2025-12-04T08:54:02.2318107Z * [new branch] gh/IvanKobzarev/166/base -> origin/gh/IvanKobzarev/166/base 2025-12-04T08:54:02.2318742Z * [new branch] gh/IvanKobzarev/166/head -> origin/gh/IvanKobzarev/166/head 2025-12-04T08:54:02.2319370Z * [new branch] gh/IvanKobzarev/166/orig -> origin/gh/IvanKobzarev/166/orig 2025-12-04T08:54:02.2320003Z * [new branch] gh/IvanKobzarev/167/base -> origin/gh/IvanKobzarev/167/base 2025-12-04T08:54:02.2320632Z * [new branch] gh/IvanKobzarev/167/head -> origin/gh/IvanKobzarev/167/head 2025-12-04T08:54:02.2321260Z * [new branch] gh/IvanKobzarev/167/orig -> origin/gh/IvanKobzarev/167/orig 2025-12-04T08:54:02.2321901Z * [new branch] gh/IvanKobzarev/168/base -> origin/gh/IvanKobzarev/168/base 2025-12-04T08:54:02.2322529Z * [new branch] gh/IvanKobzarev/168/head -> origin/gh/IvanKobzarev/168/head 2025-12-04T08:54:02.2323160Z * [new branch] gh/IvanKobzarev/168/orig -> origin/gh/IvanKobzarev/168/orig 2025-12-04T08:54:02.2323795Z * [new branch] gh/IvanKobzarev/169/base -> origin/gh/IvanKobzarev/169/base 2025-12-04T08:54:02.2324430Z * [new branch] gh/IvanKobzarev/169/head -> origin/gh/IvanKobzarev/169/head 2025-12-04T08:54:02.2325059Z * [new branch] gh/IvanKobzarev/169/orig -> origin/gh/IvanKobzarev/169/orig 2025-12-04T08:54:02.2325690Z * [new branch] gh/IvanKobzarev/170/base -> origin/gh/IvanKobzarev/170/base 2025-12-04T08:54:02.2326393Z * [new branch] gh/IvanKobzarev/170/head -> origin/gh/IvanKobzarev/170/head 2025-12-04T08:54:02.2327026Z * [new branch] gh/IvanKobzarev/170/orig -> origin/gh/IvanKobzarev/170/orig 2025-12-04T08:54:02.2327774Z * [new branch] gh/IvanKobzarev/171/base -> origin/gh/IvanKobzarev/171/base 2025-12-04T08:54:02.2328407Z * [new branch] gh/IvanKobzarev/171/head -> origin/gh/IvanKobzarev/171/head 2025-12-04T08:54:02.2329031Z * [new branch] gh/IvanKobzarev/171/orig -> origin/gh/IvanKobzarev/171/orig 2025-12-04T08:54:02.2329665Z * [new branch] gh/IvanKobzarev/172/base -> origin/gh/IvanKobzarev/172/base 2025-12-04T08:54:02.2330297Z * [new branch] gh/IvanKobzarev/172/head -> origin/gh/IvanKobzarev/172/head 2025-12-04T08:54:02.2330929Z * [new branch] gh/IvanKobzarev/172/orig -> origin/gh/IvanKobzarev/172/orig 2025-12-04T08:54:02.2331555Z * [new branch] gh/IvanKobzarev/173/base -> origin/gh/IvanKobzarev/173/base 2025-12-04T08:54:02.2332183Z * [new branch] gh/IvanKobzarev/173/head -> origin/gh/IvanKobzarev/173/head 2025-12-04T08:54:02.2332822Z * [new branch] gh/IvanKobzarev/173/orig -> origin/gh/IvanKobzarev/173/orig 2025-12-04T08:54:02.2333454Z * [new branch] gh/IvanKobzarev/174/base -> origin/gh/IvanKobzarev/174/base 2025-12-04T08:54:02.2334080Z * [new branch] gh/IvanKobzarev/174/head -> origin/gh/IvanKobzarev/174/head 2025-12-04T08:54:02.2334794Z * [new branch] gh/IvanKobzarev/174/orig -> origin/gh/IvanKobzarev/174/orig 2025-12-04T08:54:02.2335432Z * [new branch] gh/IvanKobzarev/175/base -> origin/gh/IvanKobzarev/175/base 2025-12-04T08:54:02.2336136Z * [new branch] gh/IvanKobzarev/175/head -> origin/gh/IvanKobzarev/175/head 2025-12-04T08:54:02.2336771Z * [new branch] gh/IvanKobzarev/175/orig -> origin/gh/IvanKobzarev/175/orig 2025-12-04T08:54:02.2337407Z * [new branch] gh/IvanKobzarev/176/base -> origin/gh/IvanKobzarev/176/base 2025-12-04T08:54:02.2338035Z * [new branch] gh/IvanKobzarev/176/head -> origin/gh/IvanKobzarev/176/head 2025-12-04T08:54:02.2338669Z * [new branch] gh/IvanKobzarev/176/orig -> origin/gh/IvanKobzarev/176/orig 2025-12-04T08:54:02.2339301Z * [new branch] gh/IvanKobzarev/177/base -> origin/gh/IvanKobzarev/177/base 2025-12-04T08:54:02.2339928Z * [new branch] gh/IvanKobzarev/177/head -> origin/gh/IvanKobzarev/177/head 2025-12-04T08:54:02.2340561Z * [new branch] gh/IvanKobzarev/177/orig -> origin/gh/IvanKobzarev/177/orig 2025-12-04T08:54:02.2341195Z * [new branch] gh/IvanKobzarev/178/base -> origin/gh/IvanKobzarev/178/base 2025-12-04T08:54:02.2341831Z * [new branch] gh/IvanKobzarev/178/head -> origin/gh/IvanKobzarev/178/head 2025-12-04T08:54:02.2342462Z * [new branch] gh/IvanKobzarev/178/orig -> origin/gh/IvanKobzarev/178/orig 2025-12-04T08:54:02.2343094Z * [new branch] gh/IvanKobzarev/179/base -> origin/gh/IvanKobzarev/179/base 2025-12-04T08:54:02.2343735Z * [new branch] gh/IvanKobzarev/179/head -> origin/gh/IvanKobzarev/179/head 2025-12-04T08:54:02.2344372Z * [new branch] gh/IvanKobzarev/179/orig -> origin/gh/IvanKobzarev/179/orig 2025-12-04T08:54:02.2345002Z * [new branch] gh/IvanKobzarev/180/base -> origin/gh/IvanKobzarev/180/base 2025-12-04T08:54:02.2345636Z * [new branch] gh/IvanKobzarev/180/head -> origin/gh/IvanKobzarev/180/head 2025-12-04T08:54:02.2346335Z * [new branch] gh/IvanKobzarev/180/orig -> origin/gh/IvanKobzarev/180/orig 2025-12-04T08:54:02.2346968Z * [new branch] gh/IvanKobzarev/181/base -> origin/gh/IvanKobzarev/181/base 2025-12-04T08:54:02.2347602Z * [new branch] gh/IvanKobzarev/181/head -> origin/gh/IvanKobzarev/181/head 2025-12-04T08:54:02.2348236Z * [new branch] gh/IvanKobzarev/181/orig -> origin/gh/IvanKobzarev/181/orig 2025-12-04T08:54:02.2348869Z * [new branch] gh/IvanKobzarev/182/base -> origin/gh/IvanKobzarev/182/base 2025-12-04T08:54:02.2349597Z * [new branch] gh/IvanKobzarev/182/head -> origin/gh/IvanKobzarev/182/head 2025-12-04T08:54:02.2350237Z * [new branch] gh/IvanKobzarev/182/orig -> origin/gh/IvanKobzarev/182/orig 2025-12-04T08:54:02.2350867Z * [new branch] gh/IvanKobzarev/183/base -> origin/gh/IvanKobzarev/183/base 2025-12-04T08:54:02.2351499Z * [new branch] gh/IvanKobzarev/183/head -> origin/gh/IvanKobzarev/183/head 2025-12-04T08:54:02.2352135Z * [new branch] gh/IvanKobzarev/183/orig -> origin/gh/IvanKobzarev/183/orig 2025-12-04T08:54:02.2352767Z * [new branch] gh/IvanKobzarev/184/base -> origin/gh/IvanKobzarev/184/base 2025-12-04T08:54:02.2353395Z * [new branch] gh/IvanKobzarev/184/head -> origin/gh/IvanKobzarev/184/head 2025-12-04T08:54:02.2354025Z * [new branch] gh/IvanKobzarev/184/orig -> origin/gh/IvanKobzarev/184/orig 2025-12-04T08:54:02.2354674Z * [new branch] gh/NikhilAPatel/1/base -> origin/gh/NikhilAPatel/1/base 2025-12-04T08:54:02.2355307Z * [new branch] gh/NikhilAPatel/1/head -> origin/gh/NikhilAPatel/1/head 2025-12-04T08:54:02.2356001Z * [new branch] gh/NikhilAPatel/2/base -> origin/gh/NikhilAPatel/2/base 2025-12-04T08:54:02.2356627Z * [new branch] gh/NikhilAPatel/2/head -> origin/gh/NikhilAPatel/2/head 2025-12-04T08:54:02.2357326Z * [new branch] gh/NikhilAPatel/4/base -> origin/gh/NikhilAPatel/4/base 2025-12-04T08:54:02.2357946Z * [new branch] gh/NikhilAPatel/4/head -> origin/gh/NikhilAPatel/4/head 2025-12-04T08:54:02.2358566Z * [new branch] gh/NikhilAPatel/5/base -> origin/gh/NikhilAPatel/5/base 2025-12-04T08:54:02.2359178Z * [new branch] gh/NikhilAPatel/5/head -> origin/gh/NikhilAPatel/5/head 2025-12-04T08:54:02.2359794Z * [new branch] gh/NikhilAPatel/5/orig -> origin/gh/NikhilAPatel/5/orig 2025-12-04T08:54:02.2360405Z * [new branch] gh/PaliC/17/base -> origin/gh/PaliC/17/base 2025-12-04T08:54:02.2360969Z * [new branch] gh/PaliC/17/head -> origin/gh/PaliC/17/head 2025-12-04T08:54:02.2361532Z * [new branch] gh/PaliC/17/orig -> origin/gh/PaliC/17/orig 2025-12-04T08:54:02.2362088Z * [new branch] gh/PaliC/18/base -> origin/gh/PaliC/18/base 2025-12-04T08:54:02.2362648Z * [new branch] gh/PaliC/18/head -> origin/gh/PaliC/18/head 2025-12-04T08:54:02.2363198Z * [new branch] gh/PaliC/18/orig -> origin/gh/PaliC/18/orig 2025-12-04T08:54:02.2363742Z * [new branch] gh/PaliC/20/base -> origin/gh/PaliC/20/base 2025-12-04T08:54:02.2364286Z * [new branch] gh/PaliC/20/head -> origin/gh/PaliC/20/head 2025-12-04T08:54:02.2364838Z * [new branch] gh/PaliC/20/orig -> origin/gh/PaliC/20/orig 2025-12-04T08:54:02.2365383Z * [new branch] gh/PaliC/21/base -> origin/gh/PaliC/21/base 2025-12-04T08:54:02.2365997Z * [new branch] gh/PaliC/21/head -> origin/gh/PaliC/21/head 2025-12-04T08:54:02.2366542Z * [new branch] gh/PaliC/21/orig -> origin/gh/PaliC/21/orig 2025-12-04T08:54:02.2367087Z * [new branch] gh/PaliC/23/base -> origin/gh/PaliC/23/base 2025-12-04T08:54:02.2367636Z * [new branch] gh/PaliC/23/head -> origin/gh/PaliC/23/head 2025-12-04T08:54:02.2368184Z * [new branch] gh/PaliC/23/orig -> origin/gh/PaliC/23/orig 2025-12-04T08:54:02.2368725Z * [new branch] gh/PaliC/24/base -> origin/gh/PaliC/24/base 2025-12-04T08:54:02.2369277Z * [new branch] gh/PaliC/24/head -> origin/gh/PaliC/24/head 2025-12-04T08:54:02.2369821Z * [new branch] gh/PaliC/24/orig -> origin/gh/PaliC/24/orig 2025-12-04T08:54:02.2370464Z * [new branch] gh/PaliC/25/head -> origin/gh/PaliC/25/head 2025-12-04T08:54:02.2371010Z * [new branch] gh/PaliC/25/next -> origin/gh/PaliC/25/next 2025-12-04T08:54:02.2371562Z * [new branch] gh/PaliC/25/orig -> origin/gh/PaliC/25/orig 2025-12-04T08:54:02.2372101Z * [new branch] gh/PaliC/26/head -> origin/gh/PaliC/26/head 2025-12-04T08:54:02.2372649Z * [new branch] gh/PaliC/26/next -> origin/gh/PaliC/26/next 2025-12-04T08:54:02.2373201Z * [new branch] gh/PaliC/26/orig -> origin/gh/PaliC/26/orig 2025-12-04T08:54:02.2373747Z * [new branch] gh/PaliC/27/next -> origin/gh/PaliC/27/next 2025-12-04T08:54:02.2374288Z * [new branch] gh/PaliC/28/head -> origin/gh/PaliC/28/head 2025-12-04T08:54:02.2374831Z * [new branch] gh/PaliC/28/next -> origin/gh/PaliC/28/next 2025-12-04T08:54:02.2375376Z * [new branch] gh/PaliC/28/orig -> origin/gh/PaliC/28/orig 2025-12-04T08:54:02.2375999Z * [new branch] gh/PaliC/29/head -> origin/gh/PaliC/29/head 2025-12-04T08:54:02.2376541Z * [new branch] gh/PaliC/29/next -> origin/gh/PaliC/29/next 2025-12-04T08:54:02.2377087Z * [new branch] gh/PaliC/29/orig -> origin/gh/PaliC/29/orig 2025-12-04T08:54:02.2377720Z * [new branch] gh/PaliC/30/head -> origin/gh/PaliC/30/head 2025-12-04T08:54:02.2378272Z * [new branch] gh/PaliC/30/next -> origin/gh/PaliC/30/next 2025-12-04T08:54:02.2378825Z * [new branch] gh/PaliC/30/orig -> origin/gh/PaliC/30/orig 2025-12-04T08:54:02.2379373Z * [new branch] gh/PaliC/31/head -> origin/gh/PaliC/31/head 2025-12-04T08:54:02.2379915Z * [new branch] gh/PaliC/31/next -> origin/gh/PaliC/31/next 2025-12-04T08:54:02.2380469Z * [new branch] gh/PaliC/31/orig -> origin/gh/PaliC/31/orig 2025-12-04T08:54:02.2381060Z * [new branch] gh/PaulZhang12/25/base -> origin/gh/PaulZhang12/25/base 2025-12-04T08:54:02.2381678Z * [new branch] gh/PaulZhang12/25/head -> origin/gh/PaulZhang12/25/head 2025-12-04T08:54:02.2382289Z * [new branch] gh/PaulZhang12/25/orig -> origin/gh/PaulZhang12/25/orig 2025-12-04T08:54:02.2382901Z * [new branch] gh/PaulZhang12/28/base -> origin/gh/PaulZhang12/28/base 2025-12-04T08:54:02.2383501Z * [new branch] gh/PaulZhang12/28/head -> origin/gh/PaulZhang12/28/head 2025-12-04T08:54:02.2384107Z * [new branch] gh/PaulZhang12/28/orig -> origin/gh/PaulZhang12/28/orig 2025-12-04T08:54:02.2384713Z * [new branch] gh/PaulZhang12/31/base -> origin/gh/PaulZhang12/31/base 2025-12-04T08:54:02.2385309Z * [new branch] gh/PaulZhang12/31/head -> origin/gh/PaulZhang12/31/head 2025-12-04T08:54:02.2385911Z * [new branch] gh/PaulZhang12/31/orig -> origin/gh/PaulZhang12/31/orig 2025-12-04T08:54:02.2386587Z * [new branch] gh/PaulZhang12/37/base -> origin/gh/PaulZhang12/37/base 2025-12-04T08:54:02.2387381Z * [new branch] gh/PaulZhang12/37/head -> origin/gh/PaulZhang12/37/head 2025-12-04T08:54:02.2387988Z * [new branch] gh/PaulZhang12/37/orig -> origin/gh/PaulZhang12/37/orig 2025-12-04T08:54:02.2388591Z * [new branch] gh/PaulZhang12/40/base -> origin/gh/PaulZhang12/40/base 2025-12-04T08:54:02.2389199Z * [new branch] gh/PaulZhang12/40/head -> origin/gh/PaulZhang12/40/head 2025-12-04T08:54:02.2389806Z * [new branch] gh/PaulZhang12/40/orig -> origin/gh/PaulZhang12/40/orig 2025-12-04T08:54:02.2390403Z * [new branch] gh/PaulZhang12/42/base -> origin/gh/PaulZhang12/42/base 2025-12-04T08:54:02.2391003Z * [new branch] gh/PaulZhang12/42/head -> origin/gh/PaulZhang12/42/head 2025-12-04T08:54:02.2391726Z * [new branch] gh/PaulZhang12/43/base -> origin/gh/PaulZhang12/43/base 2025-12-04T08:54:02.2392322Z * [new branch] gh/PaulZhang12/43/head -> origin/gh/PaulZhang12/43/head 2025-12-04T08:54:02.2392928Z * [new branch] gh/PaulZhang12/43/orig -> origin/gh/PaulZhang12/43/orig 2025-12-04T08:54:02.2393536Z * [new branch] gh/PaulZhang12/44/base -> origin/gh/PaulZhang12/44/base 2025-12-04T08:54:02.2394141Z * [new branch] gh/PaulZhang12/44/head -> origin/gh/PaulZhang12/44/head 2025-12-04T08:54:02.2394744Z * [new branch] gh/PaulZhang12/45/base -> origin/gh/PaulZhang12/45/base 2025-12-04T08:54:02.2395348Z * [new branch] gh/PaulZhang12/45/head -> origin/gh/PaulZhang12/45/head 2025-12-04T08:54:02.2396011Z * [new branch] gh/PaulZhang12/45/orig -> origin/gh/PaulZhang12/45/orig 2025-12-04T08:54:02.2396615Z * [new branch] gh/PaulZhang12/46/base -> origin/gh/PaulZhang12/46/base 2025-12-04T08:54:02.2397227Z * [new branch] gh/PaulZhang12/46/head -> origin/gh/PaulZhang12/46/head 2025-12-04T08:54:02.2397841Z * [new branch] gh/PaulZhang12/46/orig -> origin/gh/PaulZhang12/46/orig 2025-12-04T08:54:02.2398456Z * [new branch] gh/PaulZhang12/47/base -> origin/gh/PaulZhang12/47/base 2025-12-04T08:54:02.2399063Z * [new branch] gh/PaulZhang12/47/head -> origin/gh/PaulZhang12/47/head 2025-12-04T08:54:02.2399776Z * [new branch] gh/PaulZhang12/47/orig -> origin/gh/PaulZhang12/47/orig 2025-12-04T08:54:02.2400393Z * [new branch] gh/PaulZhang12/48/base -> origin/gh/PaulZhang12/48/base 2025-12-04T08:54:02.2400999Z * [new branch] gh/PaulZhang12/48/head -> origin/gh/PaulZhang12/48/head 2025-12-04T08:54:02.2401601Z * [new branch] gh/PaulZhang12/48/orig -> origin/gh/PaulZhang12/48/orig 2025-12-04T08:54:02.2402208Z * [new branch] gh/SamGinzburg/11/base -> origin/gh/SamGinzburg/11/base 2025-12-04T08:54:02.2402823Z * [new branch] gh/SamGinzburg/11/head -> origin/gh/SamGinzburg/11/head 2025-12-04T08:54:02.2403453Z * [new branch] gh/SherlockNoMad/1/base -> origin/gh/SherlockNoMad/1/base 2025-12-04T08:54:02.2404087Z * [new branch] gh/SherlockNoMad/1/head -> origin/gh/SherlockNoMad/1/head 2025-12-04T08:54:02.2404733Z * [new branch] gh/SherlockNoMad/10/base -> origin/gh/SherlockNoMad/10/base 2025-12-04T08:54:02.2405386Z * [new branch] gh/SherlockNoMad/10/head -> origin/gh/SherlockNoMad/10/head 2025-12-04T08:54:02.2406085Z * [new branch] gh/SherlockNoMad/10/orig -> origin/gh/SherlockNoMad/10/orig 2025-12-04T08:54:02.2406724Z * [new branch] gh/SherlockNoMad/11/base -> origin/gh/SherlockNoMad/11/base 2025-12-04T08:54:02.2409101Z * [new branch] gh/SherlockNoMad/11/head -> origin/gh/SherlockNoMad/11/head 2025-12-04T08:54:02.2409915Z * [new branch] gh/SherlockNoMad/11/orig -> origin/gh/SherlockNoMad/11/orig 2025-12-04T08:54:02.2410634Z * [new branch] gh/SherlockNoMad/12/base -> origin/gh/SherlockNoMad/12/base 2025-12-04T08:54:02.2411429Z * [new branch] gh/SherlockNoMad/12/head -> origin/gh/SherlockNoMad/12/head 2025-12-04T08:54:02.2412183Z * [new branch] gh/SherlockNoMad/12/orig -> origin/gh/SherlockNoMad/12/orig 2025-12-04T08:54:02.2412903Z * [new branch] gh/SherlockNoMad/15/base -> origin/gh/SherlockNoMad/15/base 2025-12-04T08:54:02.2413715Z * [new branch] gh/SherlockNoMad/15/head -> origin/gh/SherlockNoMad/15/head 2025-12-04T08:54:02.2414460Z * [new branch] gh/SherlockNoMad/15/orig -> origin/gh/SherlockNoMad/15/orig 2025-12-04T08:54:02.2415181Z * [new branch] gh/SherlockNoMad/17/base -> origin/gh/SherlockNoMad/17/base 2025-12-04T08:54:02.2416092Z * [new branch] gh/SherlockNoMad/17/head -> origin/gh/SherlockNoMad/17/head 2025-12-04T08:54:02.2416921Z * [new branch] gh/SherlockNoMad/17/orig -> origin/gh/SherlockNoMad/17/orig 2025-12-04T08:54:02.2417709Z * [new branch] gh/SherlockNoMad/18/base -> origin/gh/SherlockNoMad/18/base 2025-12-04T08:54:02.2418461Z * [new branch] gh/SherlockNoMad/18/head -> origin/gh/SherlockNoMad/18/head 2025-12-04T08:54:02.2419185Z * [new branch] gh/SherlockNoMad/18/orig -> origin/gh/SherlockNoMad/18/orig 2025-12-04T08:54:02.2420004Z * [new branch] gh/SherlockNoMad/19/base -> origin/gh/SherlockNoMad/19/base 2025-12-04T08:54:02.2420736Z * [new branch] gh/SherlockNoMad/19/head -> origin/gh/SherlockNoMad/19/head 2025-12-04T08:54:02.2421503Z * [new branch] gh/SherlockNoMad/19/orig -> origin/gh/SherlockNoMad/19/orig 2025-12-04T08:54:02.2422310Z * [new branch] gh/SherlockNoMad/2/base -> origin/gh/SherlockNoMad/2/base 2025-12-04T08:54:02.2456259Z * [new branch] gh/SherlockNoMad/2/head -> origin/gh/SherlockNoMad/2/head 2025-12-04T08:54:02.2456770Z * [new branch] gh/SherlockNoMad/20/base -> origin/gh/SherlockNoMad/20/base 2025-12-04T08:54:02.2457205Z * [new branch] gh/SherlockNoMad/20/head -> origin/gh/SherlockNoMad/20/head 2025-12-04T08:54:02.2457653Z * [new branch] gh/SherlockNoMad/20/orig -> origin/gh/SherlockNoMad/20/orig 2025-12-04T08:54:02.2458226Z * [new branch] gh/SherlockNoMad/21/base -> origin/gh/SherlockNoMad/21/base 2025-12-04T08:54:02.2458638Z * [new branch] gh/SherlockNoMad/21/head -> origin/gh/SherlockNoMad/21/head 2025-12-04T08:54:02.2459038Z * [new branch] gh/SherlockNoMad/21/orig -> origin/gh/SherlockNoMad/21/orig 2025-12-04T08:54:02.2459447Z * [new branch] gh/SherlockNoMad/3/base -> origin/gh/SherlockNoMad/3/base 2025-12-04T08:54:02.2459851Z * [new branch] gh/SherlockNoMad/3/head -> origin/gh/SherlockNoMad/3/head 2025-12-04T08:54:02.2460249Z * [new branch] gh/SherlockNoMad/4/base -> origin/gh/SherlockNoMad/4/base 2025-12-04T08:54:02.2460646Z * [new branch] gh/SherlockNoMad/4/head -> origin/gh/SherlockNoMad/4/head 2025-12-04T08:54:02.2461043Z * [new branch] gh/SherlockNoMad/5/base -> origin/gh/SherlockNoMad/5/base 2025-12-04T08:54:02.2461433Z * [new branch] gh/SherlockNoMad/5/head -> origin/gh/SherlockNoMad/5/head 2025-12-04T08:54:02.2461868Z * [new branch] gh/Sidharth123-cpu/24/base -> origin/gh/Sidharth123-cpu/24/base 2025-12-04T08:54:02.2462313Z * [new branch] gh/Sidharth123-cpu/25/base -> origin/gh/Sidharth123-cpu/25/base 2025-12-04T08:54:02.2462741Z * [new branch] gh/Sidharth123-cpu/26/base -> origin/gh/Sidharth123-cpu/26/base 2025-12-04T08:54:02.2463169Z * [new branch] gh/Sidharth123-cpu/27/base -> origin/gh/Sidharth123-cpu/27/base 2025-12-04T08:54:02.2463588Z * [new branch] gh/StrongerXi/1/base -> origin/gh/StrongerXi/1/base 2025-12-04T08:54:02.2463980Z * [new branch] gh/StrongerXi/1/head -> origin/gh/StrongerXi/1/head 2025-12-04T08:54:02.2464375Z * [new branch] gh/StrongerXi/71/base -> origin/gh/StrongerXi/71/base 2025-12-04T08:54:02.2464763Z * [new branch] gh/StrongerXi/71/head -> origin/gh/StrongerXi/71/head 2025-12-04T08:54:02.2465149Z * [new branch] gh/StrongerXi/72/base -> origin/gh/StrongerXi/72/base 2025-12-04T08:54:02.2465534Z * [new branch] gh/StrongerXi/72/head -> origin/gh/StrongerXi/72/head 2025-12-04T08:54:02.2465916Z * [new branch] gh/StrongerXi/73/base -> origin/gh/StrongerXi/73/base 2025-12-04T08:54:02.2466352Z * [new branch] gh/StrongerXi/73/head -> origin/gh/StrongerXi/73/head 2025-12-04T08:54:02.2466733Z * [new branch] gh/StrongerXi/73/orig -> origin/gh/StrongerXi/73/orig 2025-12-04T08:54:02.2467116Z * [new branch] gh/XilunWu/160/base -> origin/gh/XilunWu/160/base 2025-12-04T08:54:02.2467561Z * [new branch] gh/XilunWu/160/head -> origin/gh/XilunWu/160/head 2025-12-04T08:54:02.2467932Z * [new branch] gh/XilunWu/160/orig -> origin/gh/XilunWu/160/orig 2025-12-04T08:54:02.2468296Z * [new branch] gh/XilunWu/163/base -> origin/gh/XilunWu/163/base 2025-12-04T08:54:02.2468669Z * [new branch] gh/XilunWu/163/head -> origin/gh/XilunWu/163/head 2025-12-04T08:54:02.2469038Z * [new branch] gh/XilunWu/163/orig -> origin/gh/XilunWu/163/orig 2025-12-04T08:54:02.2469401Z * [new branch] gh/XilunWu/168/base -> origin/gh/XilunWu/168/base 2025-12-04T08:54:02.2469767Z * [new branch] gh/XilunWu/168/head -> origin/gh/XilunWu/168/head 2025-12-04T08:54:02.2470136Z * [new branch] gh/XilunWu/168/orig -> origin/gh/XilunWu/168/orig 2025-12-04T08:54:02.2470495Z * [new branch] gh/XilunWu/169/base -> origin/gh/XilunWu/169/base 2025-12-04T08:54:02.2470873Z * [new branch] gh/XilunWu/169/head -> origin/gh/XilunWu/169/head 2025-12-04T08:54:02.2471233Z * [new branch] gh/XilunWu/169/orig -> origin/gh/XilunWu/169/orig 2025-12-04T08:54:02.2471595Z * [new branch] gh/XilunWu/170/base -> origin/gh/XilunWu/170/base 2025-12-04T08:54:02.2472023Z * [new branch] gh/XilunWu/170/head -> origin/gh/XilunWu/170/head 2025-12-04T08:54:02.2472386Z * [new branch] gh/XilunWu/170/orig -> origin/gh/XilunWu/170/orig 2025-12-04T08:54:02.2472751Z * [new branch] gh/XilunWu/171/base -> origin/gh/XilunWu/171/base 2025-12-04T08:54:02.2473113Z * [new branch] gh/XilunWu/171/head -> origin/gh/XilunWu/171/head 2025-12-04T08:54:02.2473471Z * [new branch] gh/XilunWu/171/orig -> origin/gh/XilunWu/171/orig 2025-12-04T08:54:02.2473832Z * [new branch] gh/XilunWu/173/base -> origin/gh/XilunWu/173/base 2025-12-04T08:54:02.2474197Z * [new branch] gh/XilunWu/173/head -> origin/gh/XilunWu/173/head 2025-12-04T08:54:02.2474556Z * [new branch] gh/XilunWu/173/orig -> origin/gh/XilunWu/173/orig 2025-12-04T08:54:02.2474915Z * [new branch] gh/XilunWu/175/base -> origin/gh/XilunWu/175/base 2025-12-04T08:54:02.2475281Z * [new branch] gh/XilunWu/175/head -> origin/gh/XilunWu/175/head 2025-12-04T08:54:02.2475640Z * [new branch] gh/XilunWu/175/orig -> origin/gh/XilunWu/175/orig 2025-12-04T08:54:02.2476055Z * [new branch] gh/XilunWu/176/base -> origin/gh/XilunWu/176/base 2025-12-04T08:54:02.2476415Z * [new branch] gh/XilunWu/176/head -> origin/gh/XilunWu/176/head 2025-12-04T08:54:02.2476780Z * [new branch] gh/XilunWu/176/orig -> origin/gh/XilunWu/176/orig 2025-12-04T08:54:02.2477151Z * [new branch] gh/XuehaiPan/14/base -> origin/gh/XuehaiPan/14/base 2025-12-04T08:54:02.2477532Z * [new branch] gh/XuehaiPan/14/head -> origin/gh/XuehaiPan/14/head 2025-12-04T08:54:02.2477909Z * [new branch] gh/XuehaiPan/14/orig -> origin/gh/XuehaiPan/14/orig 2025-12-04T08:54:02.2478293Z * [new branch] gh/XuehaiPan/179/base -> origin/gh/XuehaiPan/179/base 2025-12-04T08:54:02.2478681Z * [new branch] gh/XuehaiPan/179/head -> origin/gh/XuehaiPan/179/head 2025-12-04T08:54:02.2479061Z * [new branch] gh/XuehaiPan/179/orig -> origin/gh/XuehaiPan/179/orig 2025-12-04T08:54:02.2479439Z * [new branch] gh/XuehaiPan/249/base -> origin/gh/XuehaiPan/249/base 2025-12-04T08:54:02.2479814Z * [new branch] gh/XuehaiPan/249/head -> origin/gh/XuehaiPan/249/head 2025-12-04T08:54:02.2480191Z * [new branch] gh/XuehaiPan/249/orig -> origin/gh/XuehaiPan/249/orig 2025-12-04T08:54:02.2480647Z * [new branch] gh/XuehaiPan/253/base -> origin/gh/XuehaiPan/253/base 2025-12-04T08:54:02.2481024Z * [new branch] gh/XuehaiPan/253/head -> origin/gh/XuehaiPan/253/head 2025-12-04T08:54:02.2481400Z * [new branch] gh/XuehaiPan/253/orig -> origin/gh/XuehaiPan/253/orig 2025-12-04T08:54:02.2481776Z * [new branch] gh/XuehaiPan/254/base -> origin/gh/XuehaiPan/254/base 2025-12-04T08:54:02.2482157Z * [new branch] gh/XuehaiPan/254/head -> origin/gh/XuehaiPan/254/head 2025-12-04T08:54:02.2482534Z * [new branch] gh/XuehaiPan/254/orig -> origin/gh/XuehaiPan/254/orig 2025-12-04T08:54:02.2482911Z * [new branch] gh/XuehaiPan/255/base -> origin/gh/XuehaiPan/255/base 2025-12-04T08:54:02.2483287Z * [new branch] gh/XuehaiPan/255/head -> origin/gh/XuehaiPan/255/head 2025-12-04T08:54:02.2483662Z * [new branch] gh/XuehaiPan/255/orig -> origin/gh/XuehaiPan/255/orig 2025-12-04T08:54:02.2484045Z * [new branch] gh/XuehaiPan/271/base -> origin/gh/XuehaiPan/271/base 2025-12-04T08:54:02.2484424Z * [new branch] gh/XuehaiPan/271/head -> origin/gh/XuehaiPan/271/head 2025-12-04T08:54:02.2484800Z * [new branch] gh/XuehaiPan/271/orig -> origin/gh/XuehaiPan/271/orig 2025-12-04T08:54:02.2485174Z * [new branch] gh/XuehaiPan/343/base -> origin/gh/XuehaiPan/343/base 2025-12-04T08:54:02.2485610Z * [new branch] gh/XuehaiPan/343/head -> origin/gh/XuehaiPan/343/head 2025-12-04T08:54:02.2486030Z * [new branch] gh/XuehaiPan/343/orig -> origin/gh/XuehaiPan/343/orig 2025-12-04T08:54:02.2486406Z * [new branch] gh/XuehaiPan/347/base -> origin/gh/XuehaiPan/347/base 2025-12-04T08:54:02.2486784Z * [new branch] gh/XuehaiPan/347/head -> origin/gh/XuehaiPan/347/head 2025-12-04T08:54:02.2487163Z * [new branch] gh/XuehaiPan/347/orig -> origin/gh/XuehaiPan/347/orig 2025-12-04T08:54:02.2487546Z * [new branch] gh/XuehaiPan/348/base -> origin/gh/XuehaiPan/348/base 2025-12-04T08:54:02.2487926Z * [new branch] gh/XuehaiPan/348/head -> origin/gh/XuehaiPan/348/head 2025-12-04T08:54:02.2488305Z * [new branch] gh/XuehaiPan/348/orig -> origin/gh/XuehaiPan/348/orig 2025-12-04T08:54:02.2488686Z * [new branch] gh/XuehaiPan/350/base -> origin/gh/XuehaiPan/350/base 2025-12-04T08:54:02.2489065Z * [new branch] gh/XuehaiPan/350/head -> origin/gh/XuehaiPan/350/head 2025-12-04T08:54:02.2489443Z * [new branch] gh/XuehaiPan/350/orig -> origin/gh/XuehaiPan/350/orig 2025-12-04T08:54:02.2489817Z * [new branch] gh/XuehaiPan/365/base -> origin/gh/XuehaiPan/365/base 2025-12-04T08:54:02.2490197Z * [new branch] gh/XuehaiPan/365/head -> origin/gh/XuehaiPan/365/head 2025-12-04T08:54:02.2490576Z * [new branch] gh/XuehaiPan/365/orig -> origin/gh/XuehaiPan/365/orig 2025-12-04T08:54:02.2490954Z * [new branch] gh/XuehaiPan/366/base -> origin/gh/XuehaiPan/366/base 2025-12-04T08:54:02.2491333Z * [new branch] gh/XuehaiPan/366/head -> origin/gh/XuehaiPan/366/head 2025-12-04T08:54:02.2491709Z * [new branch] gh/XuehaiPan/370/base -> origin/gh/XuehaiPan/370/base 2025-12-04T08:54:02.2492089Z * [new branch] gh/XuehaiPan/370/head -> origin/gh/XuehaiPan/370/head 2025-12-04T08:54:02.2492468Z * [new branch] gh/XuehaiPan/370/orig -> origin/gh/XuehaiPan/370/orig 2025-12-04T08:54:02.2492845Z * [new branch] gh/XuehaiPan/390/base -> origin/gh/XuehaiPan/390/base 2025-12-04T08:54:02.2493492Z * [new branch] gh/XuehaiPan/390/head -> origin/gh/XuehaiPan/390/head 2025-12-04T08:54:02.2493870Z * [new branch] gh/XuehaiPan/390/orig -> origin/gh/XuehaiPan/390/orig 2025-12-04T08:54:02.2494244Z * [new branch] gh/XuehaiPan/391/base -> origin/gh/XuehaiPan/391/base 2025-12-04T08:54:02.2494704Z * [new branch] gh/XuehaiPan/391/head -> origin/gh/XuehaiPan/391/head 2025-12-04T08:54:02.2495083Z * [new branch] gh/XuehaiPan/391/orig -> origin/gh/XuehaiPan/391/orig 2025-12-04T08:54:02.2495457Z * [new branch] gh/XuehaiPan/392/base -> origin/gh/XuehaiPan/392/base 2025-12-04T08:54:02.2495841Z * [new branch] gh/XuehaiPan/392/head -> origin/gh/XuehaiPan/392/head 2025-12-04T08:54:02.2496285Z * [new branch] gh/XuehaiPan/392/orig -> origin/gh/XuehaiPan/392/orig 2025-12-04T08:54:02.2496661Z * [new branch] gh/XuehaiPan/394/base -> origin/gh/XuehaiPan/394/base 2025-12-04T08:54:02.2497038Z * [new branch] gh/XuehaiPan/394/head -> origin/gh/XuehaiPan/394/head 2025-12-04T08:54:02.2497412Z * [new branch] gh/XuehaiPan/394/orig -> origin/gh/XuehaiPan/394/orig 2025-12-04T08:54:02.2497786Z * [new branch] gh/XuehaiPan/397/base -> origin/gh/XuehaiPan/397/base 2025-12-04T08:54:02.2498170Z * [new branch] gh/XuehaiPan/397/head -> origin/gh/XuehaiPan/397/head 2025-12-04T08:54:02.2498547Z * [new branch] gh/XuehaiPan/397/orig -> origin/gh/XuehaiPan/397/orig 2025-12-04T08:54:02.2498923Z * [new branch] gh/XuehaiPan/398/base -> origin/gh/XuehaiPan/398/base 2025-12-04T08:54:02.2499372Z * [new branch] gh/XuehaiPan/398/head -> origin/gh/XuehaiPan/398/head 2025-12-04T08:54:02.2499752Z * [new branch] gh/XuehaiPan/398/orig -> origin/gh/XuehaiPan/398/orig 2025-12-04T08:54:02.2500128Z * [new branch] gh/XuehaiPan/399/base -> origin/gh/XuehaiPan/399/base 2025-12-04T08:54:02.2500506Z * [new branch] gh/XuehaiPan/399/head -> origin/gh/XuehaiPan/399/head 2025-12-04T08:54:02.2500885Z * [new branch] gh/XuehaiPan/399/orig -> origin/gh/XuehaiPan/399/orig 2025-12-04T08:54:02.2501268Z * [new branch] gh/XuehaiPan/400/base -> origin/gh/XuehaiPan/400/base 2025-12-04T08:54:02.2501647Z * [new branch] gh/XuehaiPan/400/head -> origin/gh/XuehaiPan/400/head 2025-12-04T08:54:02.2502026Z * [new branch] gh/XuehaiPan/400/orig -> origin/gh/XuehaiPan/400/orig 2025-12-04T08:54:02.2502417Z * [new branch] gh/ZhiweiYan-96/39/base -> origin/gh/ZhiweiYan-96/39/base 2025-12-04T08:54:02.2502817Z * [new branch] gh/ZhiweiYan-96/39/head -> origin/gh/ZhiweiYan-96/39/head 2025-12-04T08:54:02.2503204Z * [new branch] gh/ZhiweiYan-96/39/orig -> origin/gh/ZhiweiYan-96/39/orig 2025-12-04T08:54:02.2503587Z * [new branch] gh/ZhiweiYan-96/44/base -> origin/gh/ZhiweiYan-96/44/base 2025-12-04T08:54:02.2503972Z * [new branch] gh/ZhiweiYan-96/44/head -> origin/gh/ZhiweiYan-96/44/head 2025-12-04T08:54:02.2504351Z * [new branch] gh/ZhiweiYan-96/45/base -> origin/gh/ZhiweiYan-96/45/base 2025-12-04T08:54:02.2504735Z * [new branch] gh/ZhiweiYan-96/45/head -> origin/gh/ZhiweiYan-96/45/head 2025-12-04T08:54:02.2505115Z * [new branch] gh/ZhiweiYan-96/49/base -> origin/gh/ZhiweiYan-96/49/base 2025-12-04T08:54:02.2505496Z * [new branch] gh/ZhiweiYan-96/49/head -> origin/gh/ZhiweiYan-96/49/head 2025-12-04T08:54:02.2505885Z * [new branch] gh/ZhiweiYan-96/62/base -> origin/gh/ZhiweiYan-96/62/base 2025-12-04T08:54:02.2506614Z * [new branch] gh/ZhiweiYan-96/62/head -> origin/gh/ZhiweiYan-96/62/head 2025-12-04T08:54:02.2507214Z * [new branch] gh/ZhiweiYan-96/66/base -> origin/gh/ZhiweiYan-96/66/base 2025-12-04T08:54:02.2507596Z * [new branch] gh/ZhiweiYan-96/66/head -> origin/gh/ZhiweiYan-96/66/head 2025-12-04T08:54:02.2507976Z * [new branch] gh/ZhiweiYan-96/67/base -> origin/gh/ZhiweiYan-96/67/base 2025-12-04T08:54:02.2508352Z * [new branch] gh/ZhiweiYan-96/67/head -> origin/gh/ZhiweiYan-96/67/head 2025-12-04T08:54:02.2508821Z * [new branch] gh/ZhiweiYan-96/68/base -> origin/gh/ZhiweiYan-96/68/base 2025-12-04T08:54:02.2509203Z * [new branch] gh/ZhiweiYan-96/68/head -> origin/gh/ZhiweiYan-96/68/head 2025-12-04T08:54:02.2509582Z * [new branch] gh/ZhiweiYan-96/68/orig -> origin/gh/ZhiweiYan-96/68/orig 2025-12-04T08:54:02.2509977Z * [new branch] gh/aakhundov/1/base -> origin/gh/aakhundov/1/base 2025-12-04T08:54:02.2510353Z * [new branch] gh/aakhundov/1/head -> origin/gh/aakhundov/1/head 2025-12-04T08:54:02.2510722Z * [new branch] gh/aakhundov/2/base -> origin/gh/aakhundov/2/base 2025-12-04T08:54:02.2511091Z * [new branch] gh/aakhundov/2/head -> origin/gh/aakhundov/2/head 2025-12-04T08:54:02.2511469Z * [new branch] gh/aditew01/openblas -> origin/gh/aditew01/openblas 2025-12-04T08:54:02.2511851Z * [new branch] gh/aditew01/sbgemm -> origin/gh/aditew01/sbgemm 2025-12-04T08:54:02.2512227Z * [new branch] gh/aditew01/vecbf16 -> origin/gh/aditew01/vecbf16 2025-12-04T08:54:02.2512591Z * [new branch] gh/albanD/4/base -> origin/gh/albanD/4/base 2025-12-04T08:54:02.2512946Z * [new branch] gh/albanD/4/head -> origin/gh/albanD/4/head 2025-12-04T08:54:02.2513370Z * [new branch] gh/albanD/4/orig -> origin/gh/albanD/4/orig 2025-12-04T08:54:02.2513909Z * [new branch] gh/alexbrauckmann/paddedtensor_faketensor_init -> origin/gh/alexbrauckmann/paddedtensor_faketensor_init 2025-12-04T08:54:02.2514459Z * [new branch] gh/alexsamardzic/12/base -> origin/gh/alexsamardzic/12/base 2025-12-04T08:54:02.2514872Z * [new branch] gh/alexsamardzic/12/head -> origin/gh/alexsamardzic/12/head 2025-12-04T08:54:02.2515272Z * [new branch] gh/alexsamardzic/12/orig -> origin/gh/alexsamardzic/12/orig 2025-12-04T08:54:02.2515677Z * [new branch] gh/alexsamardzic/14/base -> origin/gh/alexsamardzic/14/base 2025-12-04T08:54:02.2516121Z * [new branch] gh/alexsamardzic/14/head -> origin/gh/alexsamardzic/14/head 2025-12-04T08:54:02.2516519Z * [new branch] gh/alexsamardzic/14/orig -> origin/gh/alexsamardzic/14/orig 2025-12-04T08:54:02.2516925Z * [new branch] gh/alexsamardzic/15/base -> origin/gh/alexsamardzic/15/base 2025-12-04T08:54:02.2517327Z * [new branch] gh/alexsamardzic/15/head -> origin/gh/alexsamardzic/15/head 2025-12-04T08:54:02.2517725Z * [new branch] gh/alexsamardzic/15/orig -> origin/gh/alexsamardzic/15/orig 2025-12-04T08:54:02.2518114Z * [new branch] gh/amjames/18/base -> origin/gh/amjames/18/base 2025-12-04T08:54:02.2518483Z * [new branch] gh/amjames/18/head -> origin/gh/amjames/18/head 2025-12-04T08:54:02.2518841Z * [new branch] gh/amjames/18/orig -> origin/gh/amjames/18/orig 2025-12-04T08:54:02.2519224Z * [new branch] gh/andrewor14/35/base -> origin/gh/andrewor14/35/base 2025-12-04T08:54:02.2519611Z * [new branch] gh/andrewor14/35/head -> origin/gh/andrewor14/35/head 2025-12-04T08:54:02.2519986Z * [new branch] gh/andrewor14/35/orig -> origin/gh/andrewor14/35/orig 2025-12-04T08:54:02.2520372Z * [new branch] gh/andrewor14/50/base -> origin/gh/andrewor14/50/base 2025-12-04T08:54:02.2520746Z * [new branch] gh/andrewor14/50/head -> origin/gh/andrewor14/50/head 2025-12-04T08:54:02.2521121Z * [new branch] gh/andrewor14/50/orig -> origin/gh/andrewor14/50/orig 2025-12-04T08:54:02.2521497Z * [new branch] gh/andyanwang/30/base -> origin/gh/andyanwang/30/base 2025-12-04T08:54:02.2521875Z * [new branch] gh/andyanwang/30/orig -> origin/gh/andyanwang/30/orig 2025-12-04T08:54:02.2522250Z * [new branch] gh/andyanwang/31/base -> origin/gh/andyanwang/31/base 2025-12-04T08:54:02.2522709Z * [new branch] gh/andyanwang/31/orig -> origin/gh/andyanwang/31/orig 2025-12-04T08:54:02.2523087Z * [new branch] gh/andyanwang/39/base -> origin/gh/andyanwang/39/base 2025-12-04T08:54:02.2523463Z * [new branch] gh/andyanwang/39/head -> origin/gh/andyanwang/39/head 2025-12-04T08:54:02.2523844Z * [new branch] gh/andyanwang/39/orig -> origin/gh/andyanwang/39/orig 2025-12-04T08:54:02.2524218Z * [new branch] gh/andyanwang/42/base -> origin/gh/andyanwang/42/base 2025-12-04T08:54:02.2524597Z * [new branch] gh/andyanwang/42/head -> origin/gh/andyanwang/42/head 2025-12-04T08:54:02.2524976Z * [new branch] gh/andyanwang/42/orig -> origin/gh/andyanwang/42/orig 2025-12-04T08:54:02.2525351Z * [new branch] gh/andyanwang/45/base -> origin/gh/andyanwang/45/base 2025-12-04T08:54:02.2525739Z * [new branch] gh/andyanwang/45/head -> origin/gh/andyanwang/45/head 2025-12-04T08:54:02.2526162Z * [new branch] gh/andyanwang/45/orig -> origin/gh/andyanwang/45/orig 2025-12-04T08:54:02.2526541Z * [new branch] gh/angelayi/107/base -> origin/gh/angelayi/107/base 2025-12-04T08:54:02.2526919Z * [new branch] gh/angelayi/107/head -> origin/gh/angelayi/107/head 2025-12-04T08:54:02.2527369Z * [new branch] gh/angelayi/114/base -> origin/gh/angelayi/114/base 2025-12-04T08:54:02.2527737Z * [new branch] gh/angelayi/114/head -> origin/gh/angelayi/114/head 2025-12-04T08:54:02.2528109Z * [new branch] gh/angelayi/114/orig -> origin/gh/angelayi/114/orig 2025-12-04T08:54:02.2528478Z * [new branch] gh/angelayi/116/base -> origin/gh/angelayi/116/base 2025-12-04T08:54:02.2528842Z * [new branch] gh/angelayi/116/head -> origin/gh/angelayi/116/head 2025-12-04T08:54:02.2529214Z * [new branch] gh/angelayi/116/orig -> origin/gh/angelayi/116/orig 2025-12-04T08:54:02.2529585Z * [new branch] gh/angelayi/122/base -> origin/gh/angelayi/122/base 2025-12-04T08:54:02.2529949Z * [new branch] gh/angelayi/122/head -> origin/gh/angelayi/122/head 2025-12-04T08:54:02.2530314Z * [new branch] gh/angelayi/122/orig -> origin/gh/angelayi/122/orig 2025-12-04T08:54:02.2530686Z * [new branch] gh/angelayi/124/base -> origin/gh/angelayi/124/base 2025-12-04T08:54:02.2531051Z * [new branch] gh/angelayi/124/head -> origin/gh/angelayi/124/head 2025-12-04T08:54:02.2531419Z * [new branch] gh/angelayi/124/orig -> origin/gh/angelayi/124/orig 2025-12-04T08:54:02.2531787Z * [new branch] gh/angelayi/128/base -> origin/gh/angelayi/128/base 2025-12-04T08:54:02.2532152Z * [new branch] gh/angelayi/128/head -> origin/gh/angelayi/128/head 2025-12-04T08:54:02.2532522Z * [new branch] gh/angelayi/128/orig -> origin/gh/angelayi/128/orig 2025-12-04T08:54:02.2532886Z * [new branch] gh/angelayi/131/base -> origin/gh/angelayi/131/base 2025-12-04T08:54:02.2533252Z * [new branch] gh/angelayi/131/head -> origin/gh/angelayi/131/head 2025-12-04T08:54:02.2533622Z * [new branch] gh/angelayi/131/orig -> origin/gh/angelayi/131/orig 2025-12-04T08:54:02.2533990Z * [new branch] gh/angelayi/132/base -> origin/gh/angelayi/132/base 2025-12-04T08:54:02.2534361Z * [new branch] gh/angelayi/132/head -> origin/gh/angelayi/132/head 2025-12-04T08:54:02.2534729Z * [new branch] gh/angelayi/132/orig -> origin/gh/angelayi/132/orig 2025-12-04T08:54:02.2535095Z * [new branch] gh/angelayi/133/base -> origin/gh/angelayi/133/base 2025-12-04T08:54:02.2535463Z * [new branch] gh/angelayi/133/head -> origin/gh/angelayi/133/head 2025-12-04T08:54:02.2535910Z * [new branch] gh/angelayi/133/orig -> origin/gh/angelayi/133/orig 2025-12-04T08:54:02.2536341Z * [new branch] gh/angelayi/134/base -> origin/gh/angelayi/134/base 2025-12-04T08:54:02.2536710Z * [new branch] gh/angelayi/134/head -> origin/gh/angelayi/134/head 2025-12-04T08:54:02.2537077Z * [new branch] gh/angelayi/134/orig -> origin/gh/angelayi/134/orig 2025-12-04T08:54:02.2537448Z * [new branch] gh/angelayi/135/base -> origin/gh/angelayi/135/base 2025-12-04T08:54:02.2537816Z * [new branch] gh/angelayi/135/head -> origin/gh/angelayi/135/head 2025-12-04T08:54:02.2538185Z * [new branch] gh/angelayi/135/orig -> origin/gh/angelayi/135/orig 2025-12-04T08:54:02.2538554Z * [new branch] gh/angelayi/136/base -> origin/gh/angelayi/136/base 2025-12-04T08:54:02.2538919Z * [new branch] gh/angelayi/136/head -> origin/gh/angelayi/136/head 2025-12-04T08:54:02.2539295Z * [new branch] gh/angelayi/136/orig -> origin/gh/angelayi/136/orig 2025-12-04T08:54:02.2539659Z * [new branch] gh/angelayi/137/base -> origin/gh/angelayi/137/base 2025-12-04T08:54:02.2540026Z * [new branch] gh/angelayi/137/head -> origin/gh/angelayi/137/head 2025-12-04T08:54:02.2540469Z * [new branch] gh/angelayi/137/orig -> origin/gh/angelayi/137/orig 2025-12-04T08:54:02.2540842Z * [new branch] gh/angelayi/138/base -> origin/gh/angelayi/138/base 2025-12-04T08:54:02.2541209Z * [new branch] gh/angelayi/138/head -> origin/gh/angelayi/138/head 2025-12-04T08:54:02.2541575Z * [new branch] gh/angelayi/138/orig -> origin/gh/angelayi/138/orig 2025-12-04T08:54:02.2541942Z * [new branch] gh/angelayi/139/base -> origin/gh/angelayi/139/base 2025-12-04T08:54:02.2542308Z * [new branch] gh/angelayi/139/head -> origin/gh/angelayi/139/head 2025-12-04T08:54:02.2542677Z * [new branch] gh/angelayi/139/orig -> origin/gh/angelayi/139/orig 2025-12-04T08:54:02.2543052Z * [new branch] gh/angelayi/140/base -> origin/gh/angelayi/140/base 2025-12-04T08:54:02.2543421Z * [new branch] gh/angelayi/140/head -> origin/gh/angelayi/140/head 2025-12-04T08:54:02.2543790Z * [new branch] gh/angelayi/140/orig -> origin/gh/angelayi/140/orig 2025-12-04T08:54:02.2544158Z * [new branch] gh/angelayi/141/base -> origin/gh/angelayi/141/base 2025-12-04T08:54:02.2544524Z * [new branch] gh/angelayi/141/head -> origin/gh/angelayi/141/head 2025-12-04T08:54:02.2544889Z * [new branch] gh/angelayi/141/orig -> origin/gh/angelayi/141/orig 2025-12-04T08:54:02.2545259Z * [new branch] gh/angelayi/142/base -> origin/gh/angelayi/142/base 2025-12-04T08:54:02.2545625Z * [new branch] gh/angelayi/142/head -> origin/gh/angelayi/142/head 2025-12-04T08:54:02.2546046Z * [new branch] gh/angelayi/142/orig -> origin/gh/angelayi/142/orig 2025-12-04T08:54:02.2546415Z * [new branch] gh/angelayi/143/base -> origin/gh/angelayi/143/base 2025-12-04T08:54:02.2546783Z * [new branch] gh/angelayi/143/head -> origin/gh/angelayi/143/head 2025-12-04T08:54:02.2547154Z * [new branch] gh/angelayi/143/orig -> origin/gh/angelayi/143/orig 2025-12-04T08:54:02.2547687Z * [new branch] gh/angelayi/144/base -> origin/gh/angelayi/144/base 2025-12-04T08:54:02.2548063Z * [new branch] gh/angelayi/144/head -> origin/gh/angelayi/144/head 2025-12-04T08:54:02.2548428Z * [new branch] gh/angelayi/144/orig -> origin/gh/angelayi/144/orig 2025-12-04T08:54:02.2548818Z * [new branch] gh/anijain2305/753/base -> origin/gh/anijain2305/753/base 2025-12-04T08:54:02.2549209Z * [new branch] gh/anijain2305/753/head -> origin/gh/anijain2305/753/head 2025-12-04T08:54:02.2549672Z * [new branch] gh/anijain2305/753/orig -> origin/gh/anijain2305/753/orig 2025-12-04T08:54:02.2550053Z * [new branch] gh/anijain2305/810/base -> origin/gh/anijain2305/810/base 2025-12-04T08:54:02.2550434Z * [new branch] gh/anijain2305/810/head -> origin/gh/anijain2305/810/head 2025-12-04T08:54:02.2550812Z * [new branch] gh/anijain2305/810/orig -> origin/gh/anijain2305/810/orig 2025-12-04T08:54:02.2551190Z * [new branch] gh/anijain2305/854/base -> origin/gh/anijain2305/854/base 2025-12-04T08:54:02.2551569Z * [new branch] gh/anijain2305/854/head -> origin/gh/anijain2305/854/head 2025-12-04T08:54:02.2551946Z * [new branch] gh/anijain2305/854/orig -> origin/gh/anijain2305/854/orig 2025-12-04T08:54:02.2552327Z * [new branch] gh/anijain2305/864/base -> origin/gh/anijain2305/864/base 2025-12-04T08:54:02.2552707Z * [new branch] gh/anijain2305/864/head -> origin/gh/anijain2305/864/head 2025-12-04T08:54:02.2553087Z * [new branch] gh/anijain2305/864/orig -> origin/gh/anijain2305/864/orig 2025-12-04T08:54:02.2553466Z * [new branch] gh/anijain2305/870/base -> origin/gh/anijain2305/870/base 2025-12-04T08:54:02.2553844Z * [new branch] gh/anijain2305/870/head -> origin/gh/anijain2305/870/head 2025-12-04T08:54:02.2554314Z * [new branch] gh/anijain2305/870/orig -> origin/gh/anijain2305/870/orig 2025-12-04T08:54:02.2554694Z * [new branch] gh/anijain2305/873/base -> origin/gh/anijain2305/873/base 2025-12-04T08:54:02.2555069Z * [new branch] gh/anijain2305/873/head -> origin/gh/anijain2305/873/head 2025-12-04T08:54:02.2555448Z * [new branch] gh/anijain2305/873/orig -> origin/gh/anijain2305/873/orig 2025-12-04T08:54:02.2555829Z * [new branch] gh/anijain2305/894/base -> origin/gh/anijain2305/894/base 2025-12-04T08:54:02.2556249Z * [new branch] gh/anijain2305/894/head -> origin/gh/anijain2305/894/head 2025-12-04T08:54:02.2556630Z * [new branch] gh/anijain2305/894/orig -> origin/gh/anijain2305/894/orig 2025-12-04T08:54:02.2557010Z * [new branch] gh/anijain2305/895/base -> origin/gh/anijain2305/895/base 2025-12-04T08:54:02.2557393Z * [new branch] gh/anijain2305/895/head -> origin/gh/anijain2305/895/head 2025-12-04T08:54:02.2557773Z * [new branch] gh/anijain2305/895/orig -> origin/gh/anijain2305/895/orig 2025-12-04T08:54:02.2558151Z * [new branch] gh/anijain2305/910/base -> origin/gh/anijain2305/910/base 2025-12-04T08:54:02.2558527Z * [new branch] gh/anijain2305/910/head -> origin/gh/anijain2305/910/head 2025-12-04T08:54:02.2558904Z * [new branch] gh/anijain2305/910/orig -> origin/gh/anijain2305/910/orig 2025-12-04T08:54:02.2559288Z * [new branch] gh/anijain2305/919/base -> origin/gh/anijain2305/919/base 2025-12-04T08:54:02.2559671Z * [new branch] gh/anijain2305/919/head -> origin/gh/anijain2305/919/head 2025-12-04T08:54:02.2560050Z * [new branch] gh/anijain2305/919/orig -> origin/gh/anijain2305/919/orig 2025-12-04T08:54:02.2560429Z * [new branch] gh/anijain2305/922/base -> origin/gh/anijain2305/922/base 2025-12-04T08:54:02.2560808Z * [new branch] gh/anijain2305/922/head -> origin/gh/anijain2305/922/head 2025-12-04T08:54:02.2561189Z * [new branch] gh/anijain2305/922/orig -> origin/gh/anijain2305/922/orig 2025-12-04T08:54:02.2561567Z * [new branch] gh/anijain2305/932/base -> origin/gh/anijain2305/932/base 2025-12-04T08:54:02.2561947Z * [new branch] gh/anijain2305/932/head -> origin/gh/anijain2305/932/head 2025-12-04T08:54:02.2562329Z * [new branch] gh/anijain2305/932/orig -> origin/gh/anijain2305/932/orig 2025-12-04T08:54:02.2562781Z * [new branch] gh/anijain2305/940/base -> origin/gh/anijain2305/940/base 2025-12-04T08:54:02.2563166Z * [new branch] gh/anijain2305/940/head -> origin/gh/anijain2305/940/head 2025-12-04T08:54:02.2563547Z * [new branch] gh/anijain2305/940/orig -> origin/gh/anijain2305/940/orig 2025-12-04T08:54:02.2563926Z * [new branch] gh/anijain2305/941/base -> origin/gh/anijain2305/941/base 2025-12-04T08:54:02.2564311Z * [new branch] gh/anijain2305/941/head -> origin/gh/anijain2305/941/head 2025-12-04T08:54:02.2564696Z * [new branch] gh/anijain2305/941/orig -> origin/gh/anijain2305/941/orig 2025-12-04T08:54:02.2565075Z * [new branch] gh/anijain2305/942/base -> origin/gh/anijain2305/942/base 2025-12-04T08:54:02.2565460Z * [new branch] gh/anijain2305/942/head -> origin/gh/anijain2305/942/head 2025-12-04T08:54:02.2565841Z * [new branch] gh/anijain2305/942/orig -> origin/gh/anijain2305/942/orig 2025-12-04T08:54:02.2566269Z * [new branch] gh/anijain2305/943/base -> origin/gh/anijain2305/943/base 2025-12-04T08:54:02.2566652Z * [new branch] gh/anijain2305/943/head -> origin/gh/anijain2305/943/head 2025-12-04T08:54:02.2567033Z * [new branch] gh/anijain2305/943/orig -> origin/gh/anijain2305/943/orig 2025-12-04T08:54:02.2567485Z * [new branch] gh/anijain2305/944/base -> origin/gh/anijain2305/944/base 2025-12-04T08:54:02.2567871Z * [new branch] gh/anijain2305/944/head -> origin/gh/anijain2305/944/head 2025-12-04T08:54:02.2568252Z * [new branch] gh/anijain2305/944/orig -> origin/gh/anijain2305/944/orig 2025-12-04T08:54:02.2568628Z * [new branch] gh/anijain2305/945/base -> origin/gh/anijain2305/945/base 2025-12-04T08:54:02.2569009Z * [new branch] gh/anijain2305/945/head -> origin/gh/anijain2305/945/head 2025-12-04T08:54:02.2569393Z * [new branch] gh/anijain2305/945/orig -> origin/gh/anijain2305/945/orig 2025-12-04T08:54:02.2569774Z * [new branch] gh/anijain2305/946/base -> origin/gh/anijain2305/946/base 2025-12-04T08:54:02.2570155Z * [new branch] gh/anijain2305/946/head -> origin/gh/anijain2305/946/head 2025-12-04T08:54:02.2570532Z * [new branch] gh/anijain2305/946/orig -> origin/gh/anijain2305/946/orig 2025-12-04T08:54:02.2570922Z * [new branch] gh/anijain2305/947/base -> origin/gh/anijain2305/947/base 2025-12-04T08:54:02.2571303Z * [new branch] gh/anijain2305/947/head -> origin/gh/anijain2305/947/head 2025-12-04T08:54:02.2571681Z * [new branch] gh/anijain2305/947/orig -> origin/gh/anijain2305/947/orig 2025-12-04T08:54:02.2572063Z * [new branch] gh/anijain2305/948/base -> origin/gh/anijain2305/948/base 2025-12-04T08:54:02.2572448Z * [new branch] gh/anijain2305/948/head -> origin/gh/anijain2305/948/head 2025-12-04T08:54:02.2572833Z * [new branch] gh/anijain2305/948/orig -> origin/gh/anijain2305/948/orig 2025-12-04T08:54:02.2573217Z * [new branch] gh/anijain2305/949/base -> origin/gh/anijain2305/949/base 2025-12-04T08:54:02.2573595Z * [new branch] gh/anijain2305/949/head -> origin/gh/anijain2305/949/head 2025-12-04T08:54:02.2573974Z * [new branch] gh/anijain2305/949/orig -> origin/gh/anijain2305/949/orig 2025-12-04T08:54:02.2574364Z * [new branch] gh/anijain2305/950/base -> origin/gh/anijain2305/950/base 2025-12-04T08:54:02.2574746Z * [new branch] gh/anijain2305/950/head -> origin/gh/anijain2305/950/head 2025-12-04T08:54:02.2575124Z * [new branch] gh/anijain2305/950/orig -> origin/gh/anijain2305/950/orig 2025-12-04T08:54:02.2575516Z * [new branch] gh/anijain2305/951/base -> origin/gh/anijain2305/951/base 2025-12-04T08:54:02.2575897Z * [new branch] gh/anijain2305/951/head -> origin/gh/anijain2305/951/head 2025-12-04T08:54:02.2576402Z * [new branch] gh/anijain2305/951/orig -> origin/gh/anijain2305/951/orig 2025-12-04T08:54:02.2576786Z * [new branch] gh/anijain2305/952/base -> origin/gh/anijain2305/952/base 2025-12-04T08:54:02.2577167Z * [new branch] gh/anijain2305/952/head -> origin/gh/anijain2305/952/head 2025-12-04T08:54:02.2577547Z * [new branch] gh/anijain2305/952/orig -> origin/gh/anijain2305/952/orig 2025-12-04T08:54:02.2577929Z * [new branch] gh/anijain2305/953/base -> origin/gh/anijain2305/953/base 2025-12-04T08:54:02.2578315Z * [new branch] gh/anijain2305/953/head -> origin/gh/anijain2305/953/head 2025-12-04T08:54:02.2578693Z * [new branch] gh/anijain2305/953/orig -> origin/gh/anijain2305/953/orig 2025-12-04T08:54:02.2579075Z * [new branch] gh/anijain2305/954/base -> origin/gh/anijain2305/954/base 2025-12-04T08:54:02.2579460Z * [new branch] gh/anijain2305/954/head -> origin/gh/anijain2305/954/head 2025-12-04T08:54:02.2579844Z * [new branch] gh/anijain2305/954/orig -> origin/gh/anijain2305/954/orig 2025-12-04T08:54:02.2580225Z * [new branch] gh/anijain2305/955/base -> origin/gh/anijain2305/955/base 2025-12-04T08:54:02.2580606Z * [new branch] gh/anijain2305/955/head -> origin/gh/anijain2305/955/head 2025-12-04T08:54:02.2581066Z * [new branch] gh/anijain2305/955/orig -> origin/gh/anijain2305/955/orig 2025-12-04T08:54:02.2581453Z * [new branch] gh/anijain2305/956/base -> origin/gh/anijain2305/956/base 2025-12-04T08:54:02.2581833Z * [new branch] gh/anijain2305/956/head -> origin/gh/anijain2305/956/head 2025-12-04T08:54:02.2582214Z * [new branch] gh/anijain2305/956/orig -> origin/gh/anijain2305/956/orig 2025-12-04T08:54:02.2582597Z * [new branch] gh/anijain2305/957/base -> origin/gh/anijain2305/957/base 2025-12-04T08:54:02.2582980Z * [new branch] gh/anijain2305/957/head -> origin/gh/anijain2305/957/head 2025-12-04T08:54:02.2583364Z * [new branch] gh/anijain2305/957/orig -> origin/gh/anijain2305/957/orig 2025-12-04T08:54:02.2583749Z * [new branch] gh/anijain2305/958/base -> origin/gh/anijain2305/958/base 2025-12-04T08:54:02.2584129Z * [new branch] gh/anijain2305/958/head -> origin/gh/anijain2305/958/head 2025-12-04T08:54:02.2584513Z * [new branch] gh/anijain2305/958/orig -> origin/gh/anijain2305/958/orig 2025-12-04T08:54:02.2584893Z * [new branch] gh/anijain2305/959/base -> origin/gh/anijain2305/959/base 2025-12-04T08:54:02.2585274Z * [new branch] gh/anijain2305/959/head -> origin/gh/anijain2305/959/head 2025-12-04T08:54:02.2585656Z * [new branch] gh/anijain2305/959/orig -> origin/gh/anijain2305/959/orig 2025-12-04T08:54:02.2586100Z * [new branch] gh/anijain2305/960/base -> origin/gh/anijain2305/960/base 2025-12-04T08:54:02.2586482Z * [new branch] gh/anijain2305/960/head -> origin/gh/anijain2305/960/head 2025-12-04T08:54:02.2586866Z * [new branch] gh/anijain2305/960/orig -> origin/gh/anijain2305/960/orig 2025-12-04T08:54:02.2587247Z * [new branch] gh/anijain2305/961/base -> origin/gh/anijain2305/961/base 2025-12-04T08:54:02.2587634Z * [new branch] gh/anijain2305/961/head -> origin/gh/anijain2305/961/head 2025-12-04T08:54:02.2588015Z * [new branch] gh/anijain2305/961/orig -> origin/gh/anijain2305/961/orig 2025-12-04T08:54:02.2588395Z * [new branch] gh/anijain2305/962/base -> origin/gh/anijain2305/962/base 2025-12-04T08:54:02.2588771Z * [new branch] gh/anijain2305/962/head -> origin/gh/anijain2305/962/head 2025-12-04T08:54:02.2589154Z * [new branch] gh/anijain2305/962/orig -> origin/gh/anijain2305/962/orig 2025-12-04T08:54:02.2589534Z * [new branch] gh/anijain2305/963/base -> origin/gh/anijain2305/963/base 2025-12-04T08:54:02.2590231Z * [new branch] gh/anijain2305/963/head -> origin/gh/anijain2305/963/head 2025-12-04T08:54:02.2590613Z * [new branch] gh/anijain2305/963/orig -> origin/gh/anijain2305/963/orig 2025-12-04T08:54:02.2590991Z * [new branch] gh/anijain2305/964/base -> origin/gh/anijain2305/964/base 2025-12-04T08:54:02.2591378Z * [new branch] gh/anijain2305/964/head -> origin/gh/anijain2305/964/head 2025-12-04T08:54:02.2591758Z * [new branch] gh/anijain2305/964/orig -> origin/gh/anijain2305/964/orig 2025-12-04T08:54:02.2592142Z * [new branch] gh/anijain2305/965/base -> origin/gh/anijain2305/965/base 2025-12-04T08:54:02.2592524Z * [new branch] gh/anijain2305/965/head -> origin/gh/anijain2305/965/head 2025-12-04T08:54:02.2592909Z * [new branch] gh/anijain2305/965/orig -> origin/gh/anijain2305/965/orig 2025-12-04T08:54:02.2593287Z * [new branch] gh/anijain2305/966/base -> origin/gh/anijain2305/966/base 2025-12-04T08:54:02.2593679Z * [new branch] gh/anijain2305/966/head -> origin/gh/anijain2305/966/head 2025-12-04T08:54:02.2594059Z * [new branch] gh/anijain2305/966/orig -> origin/gh/anijain2305/966/orig 2025-12-04T08:54:02.2594363Z * [new branch] gh/anijain2305/967/base -> origin/gh/anijain2305/967/base 2025-12-04T08:54:02.2594605Z * [new branch] gh/anijain2305/967/head -> origin/gh/anijain2305/967/head 2025-12-04T08:54:02.2594795Z * [new branch] gh/anijain2305/967/orig -> origin/gh/anijain2305/967/orig 2025-12-04T08:54:02.2594979Z * [new branch] gh/anijain2305/968/base -> origin/gh/anijain2305/968/base 2025-12-04T08:54:02.2595169Z * [new branch] gh/anijain2305/968/head -> origin/gh/anijain2305/968/head 2025-12-04T08:54:02.2595356Z * [new branch] gh/anijain2305/968/orig -> origin/gh/anijain2305/968/orig 2025-12-04T08:54:02.2595547Z * [new branch] gh/anijain2305/969/base -> origin/gh/anijain2305/969/base 2025-12-04T08:54:02.2595735Z * [new branch] gh/anijain2305/969/head -> origin/gh/anijain2305/969/head 2025-12-04T08:54:02.2595960Z * [new branch] gh/anijain2305/969/orig -> origin/gh/anijain2305/969/orig 2025-12-04T08:54:02.2596147Z * [new branch] gh/anijain2305/970/base -> origin/gh/anijain2305/970/base 2025-12-04T08:54:02.2596334Z * [new branch] gh/anijain2305/970/head -> origin/gh/anijain2305/970/head 2025-12-04T08:54:02.2596521Z * [new branch] gh/anijain2305/970/orig -> origin/gh/anijain2305/970/orig 2025-12-04T08:54:02.2596709Z * [new branch] gh/anjali411/216/base -> origin/gh/anjali411/216/base 2025-12-04T08:54:02.2596895Z * [new branch] gh/anjali411/216/head -> origin/gh/anjali411/216/head 2025-12-04T08:54:02.2597078Z * [new branch] gh/anjali411/216/orig -> origin/gh/anjali411/216/orig 2025-12-04T08:54:02.2597263Z * [new branch] gh/anshul-si/1/base -> origin/gh/anshul-si/1/base 2025-12-04T08:54:02.2597448Z * [new branch] gh/anshul-si/1/head -> origin/gh/anshul-si/1/head 2025-12-04T08:54:02.2597626Z * [new branch] gh/anshul-si/2/base -> origin/gh/anshul-si/2/base 2025-12-04T08:54:02.2597807Z * [new branch] gh/anshul-si/2/head -> origin/gh/anshul-si/2/head 2025-12-04T08:54:02.2597986Z * [new branch] gh/anshul-si/3/base -> origin/gh/anshul-si/3/base 2025-12-04T08:54:02.2598164Z * [new branch] gh/anshul-si/3/head -> origin/gh/anshul-si/3/head 2025-12-04T08:54:02.2598344Z * [new branch] gh/anshul-si/4/base -> origin/gh/anshul-si/4/base 2025-12-04T08:54:02.2598522Z * [new branch] gh/anshul-si/4/head -> origin/gh/anshul-si/4/head 2025-12-04T08:54:02.2598699Z * [new branch] gh/anshul-si/5/base -> origin/gh/anshul-si/5/base 2025-12-04T08:54:02.2598937Z * [new branch] gh/anshul-si/5/head -> origin/gh/anshul-si/5/head 2025-12-04T08:54:02.2599121Z * [new branch] gh/anshul-si/53/base -> origin/gh/anshul-si/53/base 2025-12-04T08:54:02.2599301Z * [new branch] gh/anshul-si/53/head -> origin/gh/anshul-si/53/head 2025-12-04T08:54:02.2599486Z * [new branch] gh/anshul-si/58/base -> origin/gh/anshul-si/58/base 2025-12-04T08:54:02.2599670Z * [new branch] gh/anshul-si/58/head -> origin/gh/anshul-si/58/head 2025-12-04T08:54:02.2599851Z * [new branch] gh/anshul-si/66/base -> origin/gh/anshul-si/66/base 2025-12-04T08:54:02.2600031Z * [new branch] gh/anshul-si/66/head -> origin/gh/anshul-si/66/head 2025-12-04T08:54:02.2600207Z * [new branch] gh/anshul-si/66/orig -> origin/gh/anshul-si/66/orig 2025-12-04T08:54:02.2600380Z * [new branch] gh/anshul-si/67/base -> origin/gh/anshul-si/67/base 2025-12-04T08:54:02.2600563Z * [new branch] gh/anshul-si/67/head -> origin/gh/anshul-si/67/head 2025-12-04T08:54:02.2600739Z * [new branch] gh/anshul-si/67/orig -> origin/gh/anshul-si/67/orig 2025-12-04T08:54:02.2600913Z * [new branch] gh/anshul-si/68/base -> origin/gh/anshul-si/68/base 2025-12-04T08:54:02.2601148Z * [new branch] gh/anshul-si/68/head -> origin/gh/anshul-si/68/head 2025-12-04T08:54:02.2601330Z * [new branch] gh/anshul-si/68/orig -> origin/gh/anshul-si/68/orig 2025-12-04T08:54:02.2601506Z * [new branch] gh/anshul-si/69/base -> origin/gh/anshul-si/69/base 2025-12-04T08:54:02.2601682Z * [new branch] gh/anshul-si/69/head -> origin/gh/anshul-si/69/head 2025-12-04T08:54:02.2601858Z * [new branch] gh/anshul-si/69/orig -> origin/gh/anshul-si/69/orig 2025-12-04T08:54:02.2602032Z * [new branch] gh/anshul-si/70/base -> origin/gh/anshul-si/70/base 2025-12-04T08:54:02.2602218Z * [new branch] gh/anshul-si/70/head -> origin/gh/anshul-si/70/head 2025-12-04T08:54:02.2602393Z * [new branch] gh/anshul-si/70/orig -> origin/gh/anshul-si/70/orig 2025-12-04T08:54:02.2602568Z * [new branch] gh/anshul-si/71/base -> origin/gh/anshul-si/71/base 2025-12-04T08:54:02.2602750Z * [new branch] gh/anshul-si/71/head -> origin/gh/anshul-si/71/head 2025-12-04T08:54:02.2602931Z * [new branch] gh/anshul-si/71/orig -> origin/gh/anshul-si/71/orig 2025-12-04T08:54:02.2603105Z * [new branch] gh/anshul-si/72/base -> origin/gh/anshul-si/72/base 2025-12-04T08:54:02.2603281Z * [new branch] gh/anshul-si/72/head -> origin/gh/anshul-si/72/head 2025-12-04T08:54:02.2603456Z * [new branch] gh/anshul-si/72/orig -> origin/gh/anshul-si/72/orig 2025-12-04T08:54:02.2603633Z * [new branch] gh/anshul-si/73/base -> origin/gh/anshul-si/73/base 2025-12-04T08:54:02.2603816Z * [new branch] gh/anshul-si/73/head -> origin/gh/anshul-si/73/head 2025-12-04T08:54:02.2603989Z * [new branch] gh/anshul-si/73/orig -> origin/gh/anshul-si/73/orig 2025-12-04T08:54:02.2604168Z * [new branch] gh/aorenste/132/base -> origin/gh/aorenste/132/base 2025-12-04T08:54:02.2604351Z * [new branch] gh/aorenste/132/head -> origin/gh/aorenste/132/head 2025-12-04T08:54:02.2604529Z * [new branch] gh/aorenste/134/base -> origin/gh/aorenste/134/base 2025-12-04T08:54:02.2604708Z * [new branch] gh/aorenste/134/head -> origin/gh/aorenste/134/head 2025-12-04T08:54:02.2604887Z * [new branch] gh/aorenste/134/orig -> origin/gh/aorenste/134/orig 2025-12-04T08:54:02.2605064Z * [new branch] gh/aorenste/139/base -> origin/gh/aorenste/139/base 2025-12-04T08:54:02.2605245Z * [new branch] gh/aorenste/139/head -> origin/gh/aorenste/139/head 2025-12-04T08:54:02.2605461Z * [new branch] gh/aorenste/139/orig -> origin/gh/aorenste/139/orig 2025-12-04T08:54:02.2605637Z * [new branch] gh/aorenste/141/base -> origin/gh/aorenste/141/base 2025-12-04T08:54:02.2605820Z * [new branch] gh/aorenste/141/head -> origin/gh/aorenste/141/head 2025-12-04T08:54:02.2606044Z * [new branch] gh/aorenste/145/base -> origin/gh/aorenste/145/base 2025-12-04T08:54:02.2606222Z * [new branch] gh/aorenste/145/head -> origin/gh/aorenste/145/head 2025-12-04T08:54:02.2606397Z * [new branch] gh/aorenste/145/orig -> origin/gh/aorenste/145/orig 2025-12-04T08:54:02.2606580Z * [new branch] gh/aorenste/146/base -> origin/gh/aorenste/146/base 2025-12-04T08:54:02.2606755Z * [new branch] gh/aorenste/146/head -> origin/gh/aorenste/146/head 2025-12-04T08:54:02.2606932Z * [new branch] gh/aorenste/146/orig -> origin/gh/aorenste/146/orig 2025-12-04T08:54:02.2607112Z * [new branch] gh/aorenste/147/base -> origin/gh/aorenste/147/base 2025-12-04T08:54:02.2607290Z * [new branch] gh/aorenste/147/head -> origin/gh/aorenste/147/head 2025-12-04T08:54:02.2607468Z * [new branch] gh/aorenste/147/orig -> origin/gh/aorenste/147/orig 2025-12-04T08:54:02.2607690Z * [new branch] gh/aorenste/148/base -> origin/gh/aorenste/148/base 2025-12-04T08:54:02.2607867Z * [new branch] gh/aorenste/148/head -> origin/gh/aorenste/148/head 2025-12-04T08:54:02.2608050Z * [new branch] gh/aorenste/148/orig -> origin/gh/aorenste/148/orig 2025-12-04T08:54:02.2608227Z * [new branch] gh/aorenste/149/base -> origin/gh/aorenste/149/base 2025-12-04T08:54:02.2608405Z * [new branch] gh/aorenste/149/head -> origin/gh/aorenste/149/head 2025-12-04T08:54:02.2608588Z * [new branch] gh/aorenste/149/orig -> origin/gh/aorenste/149/orig 2025-12-04T08:54:02.2608767Z * [new branch] gh/aorenste/150/base -> origin/gh/aorenste/150/base 2025-12-04T08:54:02.2608948Z * [new branch] gh/aorenste/150/head -> origin/gh/aorenste/150/head 2025-12-04T08:54:02.2609126Z * [new branch] gh/aorenste/150/orig -> origin/gh/aorenste/150/orig 2025-12-04T08:54:02.2609310Z * [new branch] gh/aorenste/151/base -> origin/gh/aorenste/151/base 2025-12-04T08:54:02.2609490Z * [new branch] gh/aorenste/151/head -> origin/gh/aorenste/151/head 2025-12-04T08:54:02.2609667Z * [new branch] gh/aorenste/151/orig -> origin/gh/aorenste/151/orig 2025-12-04T08:54:02.2609843Z * [new branch] gh/aorenste/152/base -> origin/gh/aorenste/152/base 2025-12-04T08:54:02.2610026Z * [new branch] gh/aorenste/152/head -> origin/gh/aorenste/152/head 2025-12-04T08:54:02.2610205Z * [new branch] gh/aorenste/152/orig -> origin/gh/aorenste/152/orig 2025-12-04T08:54:02.2610381Z * [new branch] gh/aorenste/153/base -> origin/gh/aorenste/153/base 2025-12-04T08:54:02.2610563Z * [new branch] gh/aorenste/153/head -> origin/gh/aorenste/153/head 2025-12-04T08:54:02.2610742Z * [new branch] gh/aorenste/153/orig -> origin/gh/aorenste/153/orig 2025-12-04T08:54:02.2610921Z * [new branch] gh/aorenste/154/base -> origin/gh/aorenste/154/base 2025-12-04T08:54:02.2611098Z * [new branch] gh/aorenste/154/head -> origin/gh/aorenste/154/head 2025-12-04T08:54:02.2611282Z * [new branch] gh/aorenste/154/orig -> origin/gh/aorenste/154/orig 2025-12-04T08:54:02.2611457Z * [new branch] gh/aorenste/155/base -> origin/gh/aorenste/155/base 2025-12-04T08:54:02.2611636Z * [new branch] gh/aorenste/155/head -> origin/gh/aorenste/155/head 2025-12-04T08:54:02.2611870Z * [new branch] gh/aorenste/155/orig -> origin/gh/aorenste/155/orig 2025-12-04T08:54:02.2612047Z * [new branch] gh/aorenste/156/base -> origin/gh/aorenste/156/base 2025-12-04T08:54:02.2612227Z * [new branch] gh/aorenste/156/head -> origin/gh/aorenste/156/head 2025-12-04T08:54:02.2612402Z * [new branch] gh/aorenste/156/orig -> origin/gh/aorenste/156/orig 2025-12-04T08:54:02.2612581Z * [new branch] gh/aorenste/157/base -> origin/gh/aorenste/157/base 2025-12-04T08:54:02.2612762Z * [new branch] gh/aorenste/157/head -> origin/gh/aorenste/157/head 2025-12-04T08:54:02.2612941Z * [new branch] gh/aorenste/157/orig -> origin/gh/aorenste/157/orig 2025-12-04T08:54:02.2613118Z * [new branch] gh/aorenste/158/base -> origin/gh/aorenste/158/base 2025-12-04T08:54:02.2613296Z * [new branch] gh/aorenste/158/head -> origin/gh/aorenste/158/head 2025-12-04T08:54:02.2613482Z * [new branch] gh/aorenste/158/orig -> origin/gh/aorenste/158/orig 2025-12-04T08:54:02.2613662Z * [new branch] gh/aorenste/159/base -> origin/gh/aorenste/159/base 2025-12-04T08:54:02.2613841Z * [new branch] gh/aorenste/159/head -> origin/gh/aorenste/159/head 2025-12-04T08:54:02.2614067Z * [new branch] gh/aorenste/159/orig -> origin/gh/aorenste/159/orig 2025-12-04T08:54:02.2614260Z * [new branch] gh/avikchaudhuri/1/base -> origin/gh/avikchaudhuri/1/base 2025-12-04T08:54:02.2614453Z * [new branch] gh/avikchaudhuri/1/head -> origin/gh/avikchaudhuri/1/head 2025-12-04T08:54:02.2614640Z * [new branch] gh/avikchaudhuri/2/base -> origin/gh/avikchaudhuri/2/base 2025-12-04T08:54:02.2614833Z * [new branch] gh/avikchaudhuri/2/head -> origin/gh/avikchaudhuri/2/head 2025-12-04T08:54:02.2615024Z * [new branch] gh/avikchaudhuri/2/orig -> origin/gh/avikchaudhuri/2/orig 2025-12-04T08:54:02.2615208Z * [new branch] gh/bdhirsh/666/base -> origin/gh/bdhirsh/666/base 2025-12-04T08:54:02.2615385Z * [new branch] gh/bdhirsh/666/head -> origin/gh/bdhirsh/666/head 2025-12-04T08:54:02.2615564Z * [new branch] gh/bdhirsh/666/orig -> origin/gh/bdhirsh/666/orig 2025-12-04T08:54:02.2615743Z * [new branch] gh/bdhirsh/668/base -> origin/gh/bdhirsh/668/base 2025-12-04T08:54:02.2615918Z * [new branch] gh/bdhirsh/668/head -> origin/gh/bdhirsh/668/head 2025-12-04T08:54:02.2616137Z * [new branch] gh/bdhirsh/668/orig -> origin/gh/bdhirsh/668/orig 2025-12-04T08:54:02.2616313Z * [new branch] gh/bdhirsh/669/base -> origin/gh/bdhirsh/669/base 2025-12-04T08:54:02.2616490Z * [new branch] gh/bdhirsh/669/head -> origin/gh/bdhirsh/669/head 2025-12-04T08:54:02.2616661Z * [new branch] gh/bdhirsh/669/orig -> origin/gh/bdhirsh/669/orig 2025-12-04T08:54:02.2616844Z * [new branch] gh/bdhirsh/670/base -> origin/gh/bdhirsh/670/base 2025-12-04T08:54:02.2617019Z * [new branch] gh/bdhirsh/670/head -> origin/gh/bdhirsh/670/head 2025-12-04T08:54:02.2617193Z * [new branch] gh/bdhirsh/670/orig -> origin/gh/bdhirsh/670/orig 2025-12-04T08:54:02.2617369Z * [new branch] gh/bdhirsh/672/base -> origin/gh/bdhirsh/672/base 2025-12-04T08:54:02.2617548Z * [new branch] gh/bdhirsh/672/head -> origin/gh/bdhirsh/672/head 2025-12-04T08:54:02.2617720Z * [new branch] gh/bdhirsh/672/orig -> origin/gh/bdhirsh/672/orig 2025-12-04T08:54:02.2617894Z * [new branch] gh/bdhirsh/675/base -> origin/gh/bdhirsh/675/base 2025-12-04T08:54:02.2618067Z * [new branch] gh/bdhirsh/675/head -> origin/gh/bdhirsh/675/head 2025-12-04T08:54:02.2618245Z * [new branch] gh/bdhirsh/675/orig -> origin/gh/bdhirsh/675/orig 2025-12-04T08:54:02.2618465Z * [new branch] gh/bdhirsh/676/base -> origin/gh/bdhirsh/676/base 2025-12-04T08:54:02.2618638Z * [new branch] gh/bdhirsh/676/head -> origin/gh/bdhirsh/676/head 2025-12-04T08:54:02.2618811Z * [new branch] gh/bdhirsh/676/orig -> origin/gh/bdhirsh/676/orig 2025-12-04T08:54:02.2618990Z * [new branch] gh/bdhirsh/677/base -> origin/gh/bdhirsh/677/base 2025-12-04T08:54:02.2619163Z * [new branch] gh/bdhirsh/677/head -> origin/gh/bdhirsh/677/head 2025-12-04T08:54:02.2619232Z * [new branch] gh/bdhirsh/677/orig -> origin/gh/bdhirsh/677/orig 2025-12-04T08:54:02.2619300Z * [new branch] gh/bdhirsh/678/base -> origin/gh/bdhirsh/678/base 2025-12-04T08:54:02.2619371Z * [new branch] gh/bdhirsh/678/head -> origin/gh/bdhirsh/678/head 2025-12-04T08:54:02.2619442Z * [new branch] gh/bdhirsh/678/orig -> origin/gh/bdhirsh/678/orig 2025-12-04T08:54:02.2619513Z * [new branch] gh/bdhirsh/679/base -> origin/gh/bdhirsh/679/base 2025-12-04T08:54:02.2619582Z * [new branch] gh/bdhirsh/679/head -> origin/gh/bdhirsh/679/head 2025-12-04T08:54:02.2619650Z * [new branch] gh/bdhirsh/679/orig -> origin/gh/bdhirsh/679/orig 2025-12-04T08:54:02.2619772Z * [new branch] gh/bdhirsh/680/base -> origin/gh/bdhirsh/680/base 2025-12-04T08:54:02.2619843Z * [new branch] gh/bdhirsh/680/head -> origin/gh/bdhirsh/680/head 2025-12-04T08:54:02.2619911Z * [new branch] gh/bdhirsh/680/orig -> origin/gh/bdhirsh/680/orig 2025-12-04T08:54:02.2619981Z * [new branch] gh/bdhirsh/681/base -> origin/gh/bdhirsh/681/base 2025-12-04T08:54:02.2620052Z * [new branch] gh/bdhirsh/681/head -> origin/gh/bdhirsh/681/head 2025-12-04T08:54:02.2620119Z * [new branch] gh/bdhirsh/681/orig -> origin/gh/bdhirsh/681/orig 2025-12-04T08:54:02.2620213Z * [new branch] gh/benjaminglass1/101/base -> origin/gh/benjaminglass1/101/base 2025-12-04T08:54:02.2620304Z * [new branch] gh/benjaminglass1/101/head -> origin/gh/benjaminglass1/101/head 2025-12-04T08:54:02.2620389Z * [new branch] gh/benjaminglass1/101/orig -> origin/gh/benjaminglass1/101/orig 2025-12-04T08:54:02.2620476Z * [new branch] gh/benjaminglass1/102/base -> origin/gh/benjaminglass1/102/base 2025-12-04T08:54:02.2620566Z * [new branch] gh/benjaminglass1/102/head -> origin/gh/benjaminglass1/102/head 2025-12-04T08:54:02.2620650Z * [new branch] gh/benjaminglass1/102/orig -> origin/gh/benjaminglass1/102/orig 2025-12-04T08:54:02.2620736Z * [new branch] gh/benjaminglass1/106/base -> origin/gh/benjaminglass1/106/base 2025-12-04T08:54:02.2620819Z * [new branch] gh/benjaminglass1/106/head -> origin/gh/benjaminglass1/106/head 2025-12-04T08:54:02.2620904Z * [new branch] gh/benjaminglass1/106/orig -> origin/gh/benjaminglass1/106/orig 2025-12-04T08:54:02.2620990Z * [new branch] gh/benjaminglass1/107/base -> origin/gh/benjaminglass1/107/base 2025-12-04T08:54:02.2621074Z * [new branch] gh/benjaminglass1/107/head -> origin/gh/benjaminglass1/107/head 2025-12-04T08:54:02.2621162Z * [new branch] gh/benjaminglass1/107/orig -> origin/gh/benjaminglass1/107/orig 2025-12-04T08:54:02.2621250Z * [new branch] gh/benjaminglass1/108/base -> origin/gh/benjaminglass1/108/base 2025-12-04T08:54:02.2621333Z * [new branch] gh/benjaminglass1/108/head -> origin/gh/benjaminglass1/108/head 2025-12-04T08:54:02.2621417Z * [new branch] gh/benjaminglass1/108/orig -> origin/gh/benjaminglass1/108/orig 2025-12-04T08:54:02.2621502Z * [new branch] gh/benjaminglass1/109/base -> origin/gh/benjaminglass1/109/base 2025-12-04T08:54:02.2621623Z * [new branch] gh/benjaminglass1/109/head -> origin/gh/benjaminglass1/109/head 2025-12-04T08:54:02.2621706Z * [new branch] gh/benjaminglass1/109/orig -> origin/gh/benjaminglass1/109/orig 2025-12-04T08:54:02.2621792Z * [new branch] gh/benjaminglass1/97/base -> origin/gh/benjaminglass1/97/base 2025-12-04T08:54:02.2621879Z * [new branch] gh/benjaminglass1/97/head -> origin/gh/benjaminglass1/97/head 2025-12-04T08:54:02.2621964Z * [new branch] gh/benjaminglass1/97/orig -> origin/gh/benjaminglass1/97/orig 2025-12-04T08:54:02.2622043Z * [new branch] gh/bobrenjc93/570/base -> origin/gh/bobrenjc93/570/base 2025-12-04T08:54:02.2622116Z * [new branch] gh/bobrenjc93/570/head -> origin/gh/bobrenjc93/570/head 2025-12-04T08:54:02.2622190Z * [new branch] gh/bobrenjc93/570/orig -> origin/gh/bobrenjc93/570/orig 2025-12-04T08:54:02.2622262Z * [new branch] gh/bobrenjc93/604/base -> origin/gh/bobrenjc93/604/base 2025-12-04T08:54:02.2622335Z * [new branch] gh/bobrenjc93/604/head -> origin/gh/bobrenjc93/604/head 2025-12-04T08:54:02.2622408Z * [new branch] gh/bobrenjc93/604/orig -> origin/gh/bobrenjc93/604/orig 2025-12-04T08:54:02.2622481Z * [new branch] gh/bobrenjc93/638/base -> origin/gh/bobrenjc93/638/base 2025-12-04T08:54:02.2622579Z * [new branch] gh/bobrenjc93/638/head -> origin/gh/bobrenjc93/638/head 2025-12-04T08:54:02.2622653Z * [new branch] gh/bobrenjc93/638/orig -> origin/gh/bobrenjc93/638/orig 2025-12-04T08:54:02.2622724Z * [new branch] gh/bobrenjc93/653/base -> origin/gh/bobrenjc93/653/base 2025-12-04T08:54:02.2622796Z * [new branch] gh/bobrenjc93/653/head -> origin/gh/bobrenjc93/653/head 2025-12-04T08:54:02.2622868Z * [new branch] gh/bobrenjc93/653/orig -> origin/gh/bobrenjc93/653/orig 2025-12-04T08:54:02.2622940Z * [new branch] gh/bobrenjc93/654/base -> origin/gh/bobrenjc93/654/base 2025-12-04T08:54:02.2623013Z * [new branch] gh/bobrenjc93/654/head -> origin/gh/bobrenjc93/654/head 2025-12-04T08:54:02.2623085Z * [new branch] gh/bobrenjc93/654/orig -> origin/gh/bobrenjc93/654/orig 2025-12-04T08:54:02.2623158Z * [new branch] gh/bobrenjc93/657/base -> origin/gh/bobrenjc93/657/base 2025-12-04T08:54:02.2623233Z * [new branch] gh/bobrenjc93/657/head -> origin/gh/bobrenjc93/657/head 2025-12-04T08:54:02.2623305Z * [new branch] gh/bobrenjc93/657/orig -> origin/gh/bobrenjc93/657/orig 2025-12-04T08:54:02.2623377Z * [new branch] gh/bobrenjc93/672/base -> origin/gh/bobrenjc93/672/base 2025-12-04T08:54:02.2623448Z * [new branch] gh/bobrenjc93/672/head -> origin/gh/bobrenjc93/672/head 2025-12-04T08:54:02.2623523Z * [new branch] gh/bobrenjc93/672/orig -> origin/gh/bobrenjc93/672/orig 2025-12-04T08:54:02.2623595Z * [new branch] gh/bobrenjc93/679/base -> origin/gh/bobrenjc93/679/base 2025-12-04T08:54:02.2623669Z * [new branch] gh/bobrenjc93/679/head -> origin/gh/bobrenjc93/679/head 2025-12-04T08:54:02.2623743Z * [new branch] gh/bobrenjc93/679/orig -> origin/gh/bobrenjc93/679/orig 2025-12-04T08:54:02.2623816Z * [new branch] gh/bobrenjc93/680/base -> origin/gh/bobrenjc93/680/base 2025-12-04T08:54:02.2623890Z * [new branch] gh/bobrenjc93/680/head -> origin/gh/bobrenjc93/680/head 2025-12-04T08:54:02.2623961Z * [new branch] gh/bobrenjc93/680/orig -> origin/gh/bobrenjc93/680/orig 2025-12-04T08:54:02.2624031Z * [new branch] gh/bobrenjc93/681/base -> origin/gh/bobrenjc93/681/base 2025-12-04T08:54:02.2624104Z * [new branch] gh/bobrenjc93/681/head -> origin/gh/bobrenjc93/681/head 2025-12-04T08:54:02.2624175Z * [new branch] gh/bobrenjc93/681/orig -> origin/gh/bobrenjc93/681/orig 2025-12-04T08:54:02.2624279Z * [new branch] gh/bobrenjc93/682/base -> origin/gh/bobrenjc93/682/base 2025-12-04T08:54:02.2624354Z * [new branch] gh/bobrenjc93/682/head -> origin/gh/bobrenjc93/682/head 2025-12-04T08:54:02.2624428Z * [new branch] gh/bobrenjc93/682/orig -> origin/gh/bobrenjc93/682/orig 2025-12-04T08:54:02.2624506Z * [new branch] gh/bobrenjc93/683/base -> origin/gh/bobrenjc93/683/base 2025-12-04T08:54:02.2624579Z * [new branch] gh/bobrenjc93/683/head -> origin/gh/bobrenjc93/683/head 2025-12-04T08:54:02.2624651Z * [new branch] gh/bobrenjc93/683/orig -> origin/gh/bobrenjc93/683/orig 2025-12-04T08:54:02.2624723Z * [new branch] gh/bobrenjc93/684/base -> origin/gh/bobrenjc93/684/base 2025-12-04T08:54:02.2624795Z * [new branch] gh/bobrenjc93/684/head -> origin/gh/bobrenjc93/684/head 2025-12-04T08:54:02.2624867Z * [new branch] gh/bobrenjc93/684/orig -> origin/gh/bobrenjc93/684/orig 2025-12-04T08:54:02.2624939Z * [new branch] gh/bobrenjc93/685/base -> origin/gh/bobrenjc93/685/base 2025-12-04T08:54:02.2625011Z * [new branch] gh/bobrenjc93/685/head -> origin/gh/bobrenjc93/685/head 2025-12-04T08:54:02.2625085Z * [new branch] gh/bobrenjc93/685/orig -> origin/gh/bobrenjc93/685/orig 2025-12-04T08:54:02.2625186Z * [new branch] gh/bobrenjc93/686/base -> origin/gh/bobrenjc93/686/base 2025-12-04T08:54:02.2625260Z * [new branch] gh/bobrenjc93/686/head -> origin/gh/bobrenjc93/686/head 2025-12-04T08:54:02.2625331Z * [new branch] gh/bobrenjc93/686/orig -> origin/gh/bobrenjc93/686/orig 2025-12-04T08:54:02.2625405Z * [new branch] gh/bobrenjc93/687/base -> origin/gh/bobrenjc93/687/base 2025-12-04T08:54:02.2625476Z * [new branch] gh/bobrenjc93/687/head -> origin/gh/bobrenjc93/687/head 2025-12-04T08:54:02.2625546Z * [new branch] gh/bobrenjc93/687/orig -> origin/gh/bobrenjc93/687/orig 2025-12-04T08:54:02.2625623Z * [new branch] gh/bobrenjc93/688/base -> origin/gh/bobrenjc93/688/base 2025-12-04T08:54:02.2625697Z * [new branch] gh/bobrenjc93/688/head -> origin/gh/bobrenjc93/688/head 2025-12-04T08:54:02.2625770Z * [new branch] gh/bobrenjc93/688/orig -> origin/gh/bobrenjc93/688/orig 2025-12-04T08:54:02.2625844Z * [new branch] gh/bobrenjc93/689/base -> origin/gh/bobrenjc93/689/base 2025-12-04T08:54:02.2625916Z * [new branch] gh/bobrenjc93/689/head -> origin/gh/bobrenjc93/689/head 2025-12-04T08:54:02.2626082Z * [new branch] gh/bobrenjc93/689/orig -> origin/gh/bobrenjc93/689/orig 2025-12-04T08:54:02.2626156Z * [new branch] gh/bobrenjc93/690/base -> origin/gh/bobrenjc93/690/base 2025-12-04T08:54:02.2626227Z * [new branch] gh/bobrenjc93/690/head -> origin/gh/bobrenjc93/690/head 2025-12-04T08:54:02.2626300Z * [new branch] gh/bobrenjc93/690/orig -> origin/gh/bobrenjc93/690/orig 2025-12-04T08:54:02.2626374Z * [new branch] gh/bobrenjc93/691/base -> origin/gh/bobrenjc93/691/base 2025-12-04T08:54:02.2626447Z * [new branch] gh/bobrenjc93/691/head -> origin/gh/bobrenjc93/691/head 2025-12-04T08:54:02.2626520Z * [new branch] gh/bobrenjc93/691/orig -> origin/gh/bobrenjc93/691/orig 2025-12-04T08:54:02.2626595Z * [new branch] gh/bobrenjc93/692/base -> origin/gh/bobrenjc93/692/base 2025-12-04T08:54:02.2626666Z * [new branch] gh/bobrenjc93/692/head -> origin/gh/bobrenjc93/692/head 2025-12-04T08:54:02.2626737Z * [new branch] gh/bobrenjc93/692/orig -> origin/gh/bobrenjc93/692/orig 2025-12-04T08:54:02.2626811Z * [new branch] gh/bobrenjc93/693/base -> origin/gh/bobrenjc93/693/base 2025-12-04T08:54:02.2626881Z * [new branch] gh/bobrenjc93/693/head -> origin/gh/bobrenjc93/693/head 2025-12-04T08:54:02.2626996Z * [new branch] gh/bobrenjc93/693/orig -> origin/gh/bobrenjc93/693/orig 2025-12-04T08:54:02.2627069Z * [new branch] gh/bobrenjc93/694/base -> origin/gh/bobrenjc93/694/base 2025-12-04T08:54:02.2627143Z * [new branch] gh/bobrenjc93/694/head -> origin/gh/bobrenjc93/694/head 2025-12-04T08:54:02.2627220Z * [new branch] gh/bobrenjc93/694/orig -> origin/gh/bobrenjc93/694/orig 2025-12-04T08:54:02.2627292Z * [new branch] gh/bobrenjc93/695/base -> origin/gh/bobrenjc93/695/base 2025-12-04T08:54:02.2627363Z * [new branch] gh/bobrenjc93/695/head -> origin/gh/bobrenjc93/695/head 2025-12-04T08:54:02.2627436Z * [new branch] gh/bobrenjc93/695/orig -> origin/gh/bobrenjc93/695/orig 2025-12-04T08:54:02.2627503Z * [new branch] gh/c00w/23/base -> origin/gh/c00w/23/base 2025-12-04T08:54:02.2627569Z * [new branch] gh/c00w/23/head -> origin/gh/c00w/23/head 2025-12-04T08:54:02.2627638Z * [new branch] gh/c00w/53/base -> origin/gh/c00w/53/base 2025-12-04T08:54:02.2627703Z * [new branch] gh/c00w/53/head -> origin/gh/c00w/53/head 2025-12-04T08:54:02.2627767Z * [new branch] gh/c00w/53/orig -> origin/gh/c00w/53/orig 2025-12-04T08:54:02.2627887Z * [new branch] gh/c00w/54/base -> origin/gh/c00w/54/base 2025-12-04T08:54:02.2627950Z * [new branch] gh/c00w/54/head -> origin/gh/c00w/54/head 2025-12-04T08:54:02.2628011Z * [new branch] gh/c00w/54/orig -> origin/gh/c00w/54/orig 2025-12-04T08:54:02.2628075Z * [new branch] gh/c00w/56/base -> origin/gh/c00w/56/base 2025-12-04T08:54:02.2628136Z * [new branch] gh/c00w/56/head -> origin/gh/c00w/56/head 2025-12-04T08:54:02.2628197Z * [new branch] gh/c00w/56/orig -> origin/gh/c00w/56/orig 2025-12-04T08:54:02.2628267Z * [new branch] gh/c00w/57/base -> origin/gh/c00w/57/base 2025-12-04T08:54:02.2628327Z * [new branch] gh/c00w/57/head -> origin/gh/c00w/57/head 2025-12-04T08:54:02.2628388Z * [new branch] gh/c00w/57/orig -> origin/gh/c00w/57/orig 2025-12-04T08:54:02.2628451Z * [new branch] gh/c00w/58/base -> origin/gh/c00w/58/base 2025-12-04T08:54:02.2628515Z * [new branch] gh/c00w/58/head -> origin/gh/c00w/58/head 2025-12-04T08:54:02.2628576Z * [new branch] gh/c00w/58/orig -> origin/gh/c00w/58/orig 2025-12-04T08:54:02.2628650Z * [new branch] gh/clee2000/1/base -> origin/gh/clee2000/1/base 2025-12-04T08:54:02.2628721Z * [new branch] gh/clee2000/1/head -> origin/gh/clee2000/1/head 2025-12-04T08:54:02.2628795Z * [new branch] gh/clee2000/1/orig -> origin/gh/clee2000/1/orig 2025-12-04T08:54:02.2628874Z * [new branch] gh/coconutruben/1/base -> origin/gh/coconutruben/1/base 2025-12-04T08:54:02.2628949Z * [new branch] gh/coconutruben/1/head -> origin/gh/coconutruben/1/head 2025-12-04T08:54:02.2629029Z * [new branch] gh/coconutruben/55/base -> origin/gh/coconutruben/55/base 2025-12-04T08:54:02.2629107Z * [new branch] gh/coconutruben/55/head -> origin/gh/coconutruben/55/head 2025-12-04T08:54:02.2629183Z * [new branch] gh/coconutruben/55/orig -> origin/gh/coconutruben/55/orig 2025-12-04T08:54:02.2629262Z * [new branch] gh/coconutruben/57/base -> origin/gh/coconutruben/57/base 2025-12-04T08:54:02.2629339Z * [new branch] gh/coconutruben/57/head -> origin/gh/coconutruben/57/head 2025-12-04T08:54:02.2629414Z * [new branch] gh/coconutruben/57/orig -> origin/gh/coconutruben/57/orig 2025-12-04T08:54:02.2629489Z * [new branch] gh/coconutruben/70/base -> origin/gh/coconutruben/70/base 2025-12-04T08:54:02.2629598Z * [new branch] gh/coconutruben/70/head -> origin/gh/coconutruben/70/head 2025-12-04T08:54:02.2629672Z * [new branch] gh/coconutruben/70/orig -> origin/gh/coconutruben/70/orig 2025-12-04T08:54:02.2629749Z * [new branch] gh/coconutruben/71/base -> origin/gh/coconutruben/71/base 2025-12-04T08:54:02.2629825Z * [new branch] gh/coconutruben/71/head -> origin/gh/coconutruben/71/head 2025-12-04T08:54:02.2629902Z * [new branch] gh/coconutruben/71/orig -> origin/gh/coconutruben/71/orig 2025-12-04T08:54:02.2629982Z * [new branch] gh/coconutruben/72/base -> origin/gh/coconutruben/72/base 2025-12-04T08:54:02.2630056Z * [new branch] gh/coconutruben/72/head -> origin/gh/coconutruben/72/head 2025-12-04T08:54:02.2630130Z * [new branch] gh/coconutruben/72/orig -> origin/gh/coconutruben/72/orig 2025-12-04T08:54:02.2630207Z * [new branch] gh/coconutruben/73/base -> origin/gh/coconutruben/73/base 2025-12-04T08:54:02.2630282Z * [new branch] gh/coconutruben/73/head -> origin/gh/coconutruben/73/head 2025-12-04T08:54:02.2630358Z * [new branch] gh/coconutruben/73/orig -> origin/gh/coconutruben/73/orig 2025-12-04T08:54:02.2630434Z * [new branch] gh/coconutruben/74/base -> origin/gh/coconutruben/74/base 2025-12-04T08:54:02.2630540Z * [new branch] gh/coconutruben/74/head -> origin/gh/coconutruben/74/head 2025-12-04T08:54:02.2630618Z * [new branch] gh/coconutruben/74/orig -> origin/gh/coconutruben/74/orig 2025-12-04T08:54:02.2630692Z * [new branch] gh/coconutruben/79/base -> origin/gh/coconutruben/79/base 2025-12-04T08:54:02.2630766Z * [new branch] gh/coconutruben/79/head -> origin/gh/coconutruben/79/head 2025-12-04T08:54:02.2630841Z * [new branch] gh/coconutruben/79/orig -> origin/gh/coconutruben/79/orig 2025-12-04T08:54:02.2630916Z * [new branch] gh/coconutruben/80/base -> origin/gh/coconutruben/80/base 2025-12-04T08:54:02.2630990Z * [new branch] gh/coconutruben/80/head -> origin/gh/coconutruben/80/head 2025-12-04T08:54:02.2631066Z * [new branch] gh/coconutruben/80/orig -> origin/gh/coconutruben/80/orig 2025-12-04T08:54:02.2631142Z * [new branch] gh/coconutruben/82/base -> origin/gh/coconutruben/82/base 2025-12-04T08:54:02.2631220Z * [new branch] gh/coconutruben/82/head -> origin/gh/coconutruben/82/head 2025-12-04T08:54:02.2631294Z * [new branch] gh/coconutruben/82/orig -> origin/gh/coconutruben/82/orig 2025-12-04T08:54:02.2631367Z * [new branch] gh/coconutruben/83/base -> origin/gh/coconutruben/83/base 2025-12-04T08:54:02.2631440Z * [new branch] gh/coconutruben/83/head -> origin/gh/coconutruben/83/head 2025-12-04T08:54:02.2631515Z * [new branch] gh/coconutruben/83/orig -> origin/gh/coconutruben/83/orig 2025-12-04T08:54:02.2631593Z * [new branch] gh/coconutruben/84/base -> origin/gh/coconutruben/84/base 2025-12-04T08:54:02.2631669Z * [new branch] gh/coconutruben/84/head -> origin/gh/coconutruben/84/head 2025-12-04T08:54:02.2631745Z * [new branch] gh/coconutruben/84/orig -> origin/gh/coconutruben/84/orig 2025-12-04T08:54:02.2631823Z * [new branch] gh/coconutruben/85/base -> origin/gh/coconutruben/85/base 2025-12-04T08:54:02.2631899Z * [new branch] gh/coconutruben/85/head -> origin/gh/coconutruben/85/head 2025-12-04T08:54:02.2631973Z * [new branch] gh/coconutruben/85/orig -> origin/gh/coconutruben/85/orig 2025-12-04T08:54:02.2632046Z * [new branch] gh/coconutruben/86/base -> origin/gh/coconutruben/86/base 2025-12-04T08:54:02.2632129Z * [new branch] gh/coconutruben/86/head -> origin/gh/coconutruben/86/head 2025-12-04T08:54:02.2632203Z * [new branch] gh/coconutruben/86/orig -> origin/gh/coconutruben/86/orig 2025-12-04T08:54:02.2632318Z * [new branch] gh/colinchan15/1/base -> origin/gh/colinchan15/1/base 2025-12-04T08:54:02.2632395Z * [new branch] gh/colinchan15/1/head -> origin/gh/colinchan15/1/head 2025-12-04T08:54:02.2632468Z * [new branch] gh/colinchan15/2/base -> origin/gh/colinchan15/2/base 2025-12-04T08:54:02.2632543Z * [new branch] gh/colinchan15/2/head -> origin/gh/colinchan15/2/head 2025-12-04T08:54:02.2632617Z * [new branch] gh/colinchan15/3/base -> origin/gh/colinchan15/3/base 2025-12-04T08:54:02.2632691Z * [new branch] gh/colinchan15/3/head -> origin/gh/colinchan15/3/head 2025-12-04T08:54:02.2632764Z * [new branch] gh/colinchan15/6/base -> origin/gh/colinchan15/6/base 2025-12-04T08:54:02.2632838Z * [new branch] gh/colinchan15/6/head -> origin/gh/colinchan15/6/head 2025-12-04T08:54:02.2632904Z * [new branch] gh/d4l3k/1/base -> origin/gh/d4l3k/1/base 2025-12-04T08:54:02.2632969Z * [new branch] gh/d4l3k/1/head -> origin/gh/d4l3k/1/head 2025-12-04T08:54:02.2633037Z * [new branch] gh/d4l3k/2/base -> origin/gh/d4l3k/2/base 2025-12-04T08:54:02.2633102Z * [new branch] gh/d4l3k/2/head -> origin/gh/d4l3k/2/head 2025-12-04T08:54:02.2633200Z * [new branch] gh/d4l3k/2/orig -> origin/gh/d4l3k/2/orig 2025-12-04T08:54:02.2633265Z * [new branch] gh/d4l3k/3/base -> origin/gh/d4l3k/3/base 2025-12-04T08:54:02.2633327Z * [new branch] gh/d4l3k/3/head -> origin/gh/d4l3k/3/head 2025-12-04T08:54:02.2633389Z * [new branch] gh/d4l3k/3/orig -> origin/gh/d4l3k/3/orig 2025-12-04T08:54:02.2633453Z * [new branch] gh/d4l3k/4/base -> origin/gh/d4l3k/4/base 2025-12-04T08:54:02.2633516Z * [new branch] gh/d4l3k/4/head -> origin/gh/d4l3k/4/head 2025-12-04T08:54:02.2633579Z * [new branch] gh/d4l3k/4/orig -> origin/gh/d4l3k/4/orig 2025-12-04T08:54:02.2633645Z * [new branch] gh/d4l3k/5/base -> origin/gh/d4l3k/5/base 2025-12-04T08:54:02.2633709Z * [new branch] gh/d4l3k/5/orig -> origin/gh/d4l3k/5/orig 2025-12-04T08:54:02.2633800Z * [new branch] gh/davidberard98/392/base -> origin/gh/davidberard98/392/base 2025-12-04T08:54:02.2633885Z * [new branch] gh/davidberard98/392/head -> origin/gh/davidberard98/392/head 2025-12-04T08:54:02.2633966Z * [new branch] gh/davidberard98/392/orig -> origin/gh/davidberard98/392/orig 2025-12-04T08:54:02.2634049Z * [new branch] gh/davidberard98/399/base -> origin/gh/davidberard98/399/base 2025-12-04T08:54:02.2634130Z * [new branch] gh/davidberard98/399/head -> origin/gh/davidberard98/399/head 2025-12-04T08:54:02.2634209Z * [new branch] gh/davidberard98/399/orig -> origin/gh/davidberard98/399/orig 2025-12-04T08:54:02.2634290Z * [new branch] gh/desertfire/605/base -> origin/gh/desertfire/605/base 2025-12-04T08:54:02.2634367Z * [new branch] gh/desertfire/605/head -> origin/gh/desertfire/605/head 2025-12-04T08:54:02.2634444Z * [new branch] gh/desertfire/605/orig -> origin/gh/desertfire/605/orig 2025-12-04T08:54:02.2634523Z * [new branch] gh/desertfire/606/base -> origin/gh/desertfire/606/base 2025-12-04T08:54:02.2634596Z * [new branch] gh/desertfire/606/head -> origin/gh/desertfire/606/head 2025-12-04T08:54:02.2634667Z * [new branch] gh/desertfire/606/orig -> origin/gh/desertfire/606/orig 2025-12-04T08:54:02.2634741Z * [new branch] gh/desertfire/607/base -> origin/gh/desertfire/607/base 2025-12-04T08:54:02.2634813Z * [new branch] gh/desertfire/607/head -> origin/gh/desertfire/607/head 2025-12-04T08:54:02.2634915Z * [new branch] gh/desertfire/607/orig -> origin/gh/desertfire/607/orig 2025-12-04T08:54:02.2634989Z * [new branch] gh/desertfire/608/base -> origin/gh/desertfire/608/base 2025-12-04T08:54:02.2635060Z * [new branch] gh/desertfire/608/head -> origin/gh/desertfire/608/head 2025-12-04T08:54:02.2635132Z * [new branch] gh/desertfire/608/orig -> origin/gh/desertfire/608/orig 2025-12-04T08:54:02.2635208Z * [new branch] gh/desertfire/609/base -> origin/gh/desertfire/609/base 2025-12-04T08:54:02.2635280Z * [new branch] gh/desertfire/609/head -> origin/gh/desertfire/609/head 2025-12-04T08:54:02.2635354Z * [new branch] gh/desertfire/609/orig -> origin/gh/desertfire/609/orig 2025-12-04T08:54:02.2635427Z * [new branch] gh/desertfire/610/base -> origin/gh/desertfire/610/base 2025-12-04T08:54:02.2635499Z * [new branch] gh/desertfire/610/head -> origin/gh/desertfire/610/head 2025-12-04T08:54:02.2635574Z * [new branch] gh/desertfire/610/orig -> origin/gh/desertfire/610/orig 2025-12-04T08:54:02.2635646Z * [new branch] gh/desertfire/611/base -> origin/gh/desertfire/611/base 2025-12-04T08:54:02.2635719Z * [new branch] gh/desertfire/611/head -> origin/gh/desertfire/611/head 2025-12-04T08:54:02.2635793Z * [new branch] gh/desertfire/611/orig -> origin/gh/desertfire/611/orig 2025-12-04T08:54:02.2635896Z * [new branch] gh/desertfire/612/base -> origin/gh/desertfire/612/base 2025-12-04T08:54:02.2636007Z * [new branch] gh/desertfire/612/head -> origin/gh/desertfire/612/head 2025-12-04T08:54:02.2636083Z * [new branch] gh/desertfire/612/orig -> origin/gh/desertfire/612/orig 2025-12-04T08:54:02.2636156Z * [new branch] gh/desertfire/613/base -> origin/gh/desertfire/613/base 2025-12-04T08:54:02.2636228Z * [new branch] gh/desertfire/613/head -> origin/gh/desertfire/613/head 2025-12-04T08:54:02.2636305Z * [new branch] gh/desertfire/613/orig -> origin/gh/desertfire/613/orig 2025-12-04T08:54:02.2636379Z * [new branch] gh/desertfire/614/base -> origin/gh/desertfire/614/base 2025-12-04T08:54:02.2636454Z * [new branch] gh/desertfire/614/head -> origin/gh/desertfire/614/head 2025-12-04T08:54:02.2636533Z * [new branch] gh/desertfire/614/orig -> origin/gh/desertfire/614/orig 2025-12-04T08:54:02.2636606Z * [new branch] gh/desertfire/615/base -> origin/gh/desertfire/615/base 2025-12-04T08:54:02.2636680Z * [new branch] gh/desertfire/615/head -> origin/gh/desertfire/615/head 2025-12-04T08:54:02.2636756Z * [new branch] gh/desertfire/615/orig -> origin/gh/desertfire/615/orig 2025-12-04T08:54:02.2636829Z * [new branch] gh/desertfire/616/base -> origin/gh/desertfire/616/base 2025-12-04T08:54:02.2636906Z * [new branch] gh/desertfire/616/head -> origin/gh/desertfire/616/head 2025-12-04T08:54:02.2636979Z * [new branch] gh/desertfire/616/orig -> origin/gh/desertfire/616/orig 2025-12-04T08:54:02.2637051Z * [new branch] gh/desertfire/617/base -> origin/gh/desertfire/617/base 2025-12-04T08:54:02.2637124Z * [new branch] gh/desertfire/617/head -> origin/gh/desertfire/617/head 2025-12-04T08:54:02.2637197Z * [new branch] gh/desertfire/617/orig -> origin/gh/desertfire/617/orig 2025-12-04T08:54:02.2637268Z * [new branch] gh/dharakk/1/base -> origin/gh/dharakk/1/base 2025-12-04T08:54:02.2637344Z * [new branch] gh/dharakk/1/head -> origin/gh/dharakk/1/head 2025-12-04T08:54:02.2637417Z * [new branch] gh/drisspg/170/base -> origin/gh/drisspg/170/base 2025-12-04T08:54:02.2637488Z * [new branch] gh/drisspg/170/head -> origin/gh/drisspg/170/head 2025-12-04T08:54:02.2637561Z * [new branch] gh/drisspg/170/orig -> origin/gh/drisspg/170/orig 2025-12-04T08:54:02.2637681Z * [new branch] gh/drisspg/182/base -> origin/gh/drisspg/182/base 2025-12-04T08:54:02.2637749Z * [new branch] gh/drisspg/182/head -> origin/gh/drisspg/182/head 2025-12-04T08:54:02.2637819Z * [new branch] gh/drisspg/183/base -> origin/gh/drisspg/183/base 2025-12-04T08:54:02.2637890Z * [new branch] gh/drisspg/183/head -> origin/gh/drisspg/183/head 2025-12-04T08:54:02.2637958Z * [new branch] gh/drisspg/184/base -> origin/gh/drisspg/184/base 2025-12-04T08:54:02.2638032Z * [new branch] gh/drisspg/184/head -> origin/gh/drisspg/184/head 2025-12-04T08:54:02.2638100Z * [new branch] gh/drisspg/185/base -> origin/gh/drisspg/185/base 2025-12-04T08:54:02.2638168Z * [new branch] gh/drisspg/185/head -> origin/gh/drisspg/185/head 2025-12-04T08:54:02.2638238Z * [new branch] gh/drisspg/194/base -> origin/gh/drisspg/194/base 2025-12-04T08:54:02.2638307Z * [new branch] gh/drisspg/194/head -> origin/gh/drisspg/194/head 2025-12-04T08:54:02.2638375Z * [new branch] gh/drisspg/194/orig -> origin/gh/drisspg/194/orig 2025-12-04T08:54:02.2638445Z * [new branch] gh/drisspg/200/base -> origin/gh/drisspg/200/base 2025-12-04T08:54:02.2638568Z * [new branch] gh/drisspg/200/head -> origin/gh/drisspg/200/head 2025-12-04T08:54:02.2638645Z * [new branch] gh/drisspg/200/orig -> origin/gh/drisspg/200/orig 2025-12-04T08:54:02.2638714Z * [new branch] gh/drisspg/218/base -> origin/gh/drisspg/218/base 2025-12-04T08:54:02.2638781Z * [new branch] gh/drisspg/218/head -> origin/gh/drisspg/218/head 2025-12-04T08:54:02.2638852Z * [new branch] gh/drisspg/218/orig -> origin/gh/drisspg/218/orig 2025-12-04T08:54:02.2638923Z * [new branch] gh/drisspg/219/base -> origin/gh/drisspg/219/base 2025-12-04T08:54:02.2638994Z * [new branch] gh/drisspg/219/head -> origin/gh/drisspg/219/head 2025-12-04T08:54:02.2639064Z * [new branch] gh/drisspg/219/orig -> origin/gh/drisspg/219/orig 2025-12-04T08:54:02.2639133Z * [new branch] gh/drisspg/220/base -> origin/gh/drisspg/220/base 2025-12-04T08:54:02.2639202Z * [new branch] gh/drisspg/220/head -> origin/gh/drisspg/220/head 2025-12-04T08:54:02.2639272Z * [new branch] gh/drisspg/220/orig -> origin/gh/drisspg/220/orig 2025-12-04T08:54:02.2639343Z * [new branch] gh/drisspg/221/base -> origin/gh/drisspg/221/base 2025-12-04T08:54:02.2639414Z * [new branch] gh/drisspg/221/head -> origin/gh/drisspg/221/head 2025-12-04T08:54:02.2639485Z * [new branch] gh/drisspg/221/orig -> origin/gh/drisspg/221/orig 2025-12-04T08:54:02.2639553Z * [new branch] gh/drisspg/222/base -> origin/gh/drisspg/222/base 2025-12-04T08:54:02.2639622Z * [new branch] gh/drisspg/222/head -> origin/gh/drisspg/222/head 2025-12-04T08:54:02.2639692Z * [new branch] gh/drisspg/222/orig -> origin/gh/drisspg/222/orig 2025-12-04T08:54:02.2639759Z * [new branch] gh/drisspg/223/base -> origin/gh/drisspg/223/base 2025-12-04T08:54:02.2639831Z * [new branch] gh/drisspg/223/head -> origin/gh/drisspg/223/head 2025-12-04T08:54:02.2639905Z * [new branch] gh/drisspg/223/orig -> origin/gh/drisspg/223/orig 2025-12-04T08:54:02.2639973Z * [new branch] gh/drisspg/224/base -> origin/gh/drisspg/224/base 2025-12-04T08:54:02.2640040Z * [new branch] gh/drisspg/224/head -> origin/gh/drisspg/224/head 2025-12-04T08:54:02.2640110Z * [new branch] gh/drisspg/224/orig -> origin/gh/drisspg/224/orig 2025-12-04T08:54:02.2640179Z * [new branch] gh/drisspg/225/base -> origin/gh/drisspg/225/base 2025-12-04T08:54:02.2640282Z * [new branch] gh/drisspg/225/head -> origin/gh/drisspg/225/head 2025-12-04T08:54:02.2640352Z * [new branch] gh/drisspg/225/orig -> origin/gh/drisspg/225/orig 2025-12-04T08:54:02.2640422Z * [new branch] gh/drisspg/226/base -> origin/gh/drisspg/226/base 2025-12-04T08:54:02.2640492Z * [new branch] gh/drisspg/226/head -> origin/gh/drisspg/226/head 2025-12-04T08:54:02.2640560Z * [new branch] gh/drisspg/226/orig -> origin/gh/drisspg/226/orig 2025-12-04T08:54:02.2640628Z * [new branch] gh/drisspg/227/base -> origin/gh/drisspg/227/base 2025-12-04T08:54:02.2640698Z * [new branch] gh/drisspg/227/head -> origin/gh/drisspg/227/head 2025-12-04T08:54:02.2640768Z * [new branch] gh/drisspg/227/orig -> origin/gh/drisspg/227/orig 2025-12-04T08:54:02.2640838Z * [new branch] gh/drisspg/228/base -> origin/gh/drisspg/228/base 2025-12-04T08:54:02.2640911Z * [new branch] gh/drisspg/228/head -> origin/gh/drisspg/228/head 2025-12-04T08:54:02.2640979Z * [new branch] gh/drisspg/228/orig -> origin/gh/drisspg/228/orig 2025-12-04T08:54:02.2641047Z * [new branch] gh/drisspg/229/base -> origin/gh/drisspg/229/base 2025-12-04T08:54:02.2641162Z * [new branch] gh/drisspg/229/head -> origin/gh/drisspg/229/head 2025-12-04T08:54:02.2641231Z * [new branch] gh/drisspg/229/orig -> origin/gh/drisspg/229/orig 2025-12-04T08:54:02.2641299Z * [new branch] gh/drisspg/230/base -> origin/gh/drisspg/230/base 2025-12-04T08:54:02.2641372Z * [new branch] gh/drisspg/230/head -> origin/gh/drisspg/230/head 2025-12-04T08:54:02.2641441Z * [new branch] gh/drisspg/230/orig -> origin/gh/drisspg/230/orig 2025-12-04T08:54:02.2641515Z * [new branch] gh/dsjohns2/1/base -> origin/gh/dsjohns2/1/base 2025-12-04T08:54:02.2641588Z * [new branch] gh/dsjohns2/1/head -> origin/gh/dsjohns2/1/head 2025-12-04T08:54:02.2641664Z * [new branch] gh/dzmitry-huba/1/base -> origin/gh/dzmitry-huba/1/base 2025-12-04T08:54:02.2641739Z * [new branch] gh/dzmitry-huba/1/head -> origin/gh/dzmitry-huba/1/head 2025-12-04T08:54:02.2641819Z * [new branch] gh/dzmitry-huba/12/base -> origin/gh/dzmitry-huba/12/base 2025-12-04T08:54:02.2641894Z * [new branch] gh/dzmitry-huba/12/head -> origin/gh/dzmitry-huba/12/head 2025-12-04T08:54:02.2641973Z * [new branch] gh/dzmitry-huba/12/orig -> origin/gh/dzmitry-huba/12/orig 2025-12-04T08:54:02.2642048Z * [new branch] gh/dzmitry-huba/13/base -> origin/gh/dzmitry-huba/13/base 2025-12-04T08:54:02.2642122Z * [new branch] gh/dzmitry-huba/13/head -> origin/gh/dzmitry-huba/13/head 2025-12-04T08:54:02.2642200Z * [new branch] gh/dzmitry-huba/13/orig -> origin/gh/dzmitry-huba/13/orig 2025-12-04T08:54:02.2642273Z * [new branch] gh/dzmitry-huba/14/base -> origin/gh/dzmitry-huba/14/base 2025-12-04T08:54:02.2642347Z * [new branch] gh/dzmitry-huba/14/head -> origin/gh/dzmitry-huba/14/head 2025-12-04T08:54:02.2642423Z * [new branch] gh/dzmitry-huba/14/orig -> origin/gh/dzmitry-huba/14/orig 2025-12-04T08:54:02.2642499Z * [new branch] gh/dzmitry-huba/15/base -> origin/gh/dzmitry-huba/15/base 2025-12-04T08:54:02.2642575Z * [new branch] gh/dzmitry-huba/15/head -> origin/gh/dzmitry-huba/15/head 2025-12-04T08:54:02.2642652Z * [new branch] gh/dzmitry-huba/15/orig -> origin/gh/dzmitry-huba/15/orig 2025-12-04T08:54:02.2642726Z * [new branch] gh/dzmitry-huba/16/base -> origin/gh/dzmitry-huba/16/base 2025-12-04T08:54:02.2642799Z * [new branch] gh/dzmitry-huba/16/head -> origin/gh/dzmitry-huba/16/head 2025-12-04T08:54:02.2642905Z * [new branch] gh/dzmitry-huba/16/orig -> origin/gh/dzmitry-huba/16/orig 2025-12-04T08:54:02.2642979Z * [new branch] gh/dzmitry-huba/17/base -> origin/gh/dzmitry-huba/17/base 2025-12-04T08:54:02.2643052Z * [new branch] gh/dzmitry-huba/17/head -> origin/gh/dzmitry-huba/17/head 2025-12-04T08:54:02.2643132Z * [new branch] gh/dzmitry-huba/17/orig -> origin/gh/dzmitry-huba/17/orig 2025-12-04T08:54:02.2643207Z * [new branch] gh/dzmitry-huba/2/base -> origin/gh/dzmitry-huba/2/base 2025-12-04T08:54:02.2643281Z * [new branch] gh/dzmitry-huba/2/head -> origin/gh/dzmitry-huba/2/head 2025-12-04T08:54:02.2643356Z * [new branch] gh/dzmitry-huba/3/base -> origin/gh/dzmitry-huba/3/base 2025-12-04T08:54:02.2643430Z * [new branch] gh/dzmitry-huba/3/head -> origin/gh/dzmitry-huba/3/head 2025-12-04T08:54:02.2643504Z * [new branch] gh/eellison/808/base -> origin/gh/eellison/808/base 2025-12-04T08:54:02.2643580Z * [new branch] gh/eellison/808/head -> origin/gh/eellison/808/head 2025-12-04T08:54:02.2643652Z * [new branch] gh/eellison/808/orig -> origin/gh/eellison/808/orig 2025-12-04T08:54:02.2643724Z * [new branch] gh/eellison/822/base -> origin/gh/eellison/822/base 2025-12-04T08:54:02.2643823Z * [new branch] gh/eellison/822/head -> origin/gh/eellison/822/head 2025-12-04T08:54:02.2643894Z * [new branch] gh/eellison/822/orig -> origin/gh/eellison/822/orig 2025-12-04T08:54:02.2643966Z * [new branch] gh/eellison/823/base -> origin/gh/eellison/823/base 2025-12-04T08:54:02.2644035Z * [new branch] gh/eellison/823/head -> origin/gh/eellison/823/head 2025-12-04T08:54:02.2644104Z * [new branch] gh/eellison/823/orig -> origin/gh/eellison/823/orig 2025-12-04T08:54:02.2644175Z * [new branch] gh/eellison/862/base -> origin/gh/eellison/862/base 2025-12-04T08:54:02.2644248Z * [new branch] gh/eellison/862/head -> origin/gh/eellison/862/head 2025-12-04T08:54:02.2644319Z * [new branch] gh/eellison/862/orig -> origin/gh/eellison/862/orig 2025-12-04T08:54:02.2644394Z * [new branch] gh/eellison/863/base -> origin/gh/eellison/863/base 2025-12-04T08:54:02.2644464Z * [new branch] gh/eellison/863/head -> origin/gh/eellison/863/head 2025-12-04T08:54:02.2644534Z * [new branch] gh/eellison/863/orig -> origin/gh/eellison/863/orig 2025-12-04T08:54:02.2644607Z * [new branch] gh/eellison/864/base -> origin/gh/eellison/864/base 2025-12-04T08:54:02.2644678Z * [new branch] gh/eellison/864/head -> origin/gh/eellison/864/head 2025-12-04T08:54:02.2644751Z * [new branch] gh/eellison/864/orig -> origin/gh/eellison/864/orig 2025-12-04T08:54:02.2644826Z * [new branch] gh/eellison/865/base -> origin/gh/eellison/865/base 2025-12-04T08:54:02.2644895Z * [new branch] gh/eellison/865/head -> origin/gh/eellison/865/head 2025-12-04T08:54:02.2644964Z * [new branch] gh/eellison/865/orig -> origin/gh/eellison/865/orig 2025-12-04T08:54:02.2645036Z * [new branch] gh/eellison/866/base -> origin/gh/eellison/866/base 2025-12-04T08:54:02.2645108Z * [new branch] gh/eellison/866/head -> origin/gh/eellison/866/head 2025-12-04T08:54:02.2645184Z * [new branch] gh/eellison/866/orig -> origin/gh/eellison/866/orig 2025-12-04T08:54:02.2645255Z * [new branch] gh/eellison/867/base -> origin/gh/eellison/867/base 2025-12-04T08:54:02.2645326Z * [new branch] gh/eellison/867/head -> origin/gh/eellison/867/head 2025-12-04T08:54:02.2645397Z * [new branch] gh/eellison/867/orig -> origin/gh/eellison/867/orig 2025-12-04T08:54:02.2645492Z * [new branch] gh/eellison/868/base -> origin/gh/eellison/868/base 2025-12-04T08:54:02.2645564Z * [new branch] gh/eellison/868/head -> origin/gh/eellison/868/head 2025-12-04T08:54:02.2645638Z * [new branch] gh/eellison/868/orig -> origin/gh/eellison/868/orig 2025-12-04T08:54:02.2645708Z * [new branch] gh/eellison/869/base -> origin/gh/eellison/869/base 2025-12-04T08:54:02.2645778Z * [new branch] gh/eellison/869/head -> origin/gh/eellison/869/head 2025-12-04T08:54:02.2645849Z * [new branch] gh/eellison/869/orig -> origin/gh/eellison/869/orig 2025-12-04T08:54:02.2645963Z * [new branch] gh/eellison/870/base -> origin/gh/eellison/870/base 2025-12-04T08:54:02.2646038Z * [new branch] gh/eellison/870/head -> origin/gh/eellison/870/head 2025-12-04T08:54:02.2646113Z * [new branch] gh/eellison/870/orig -> origin/gh/eellison/870/orig 2025-12-04T08:54:02.2646184Z * [new branch] gh/eellison/871/base -> origin/gh/eellison/871/base 2025-12-04T08:54:02.2646254Z * [new branch] gh/eellison/871/head -> origin/gh/eellison/871/head 2025-12-04T08:54:02.2646324Z * [new branch] gh/eellison/871/orig -> origin/gh/eellison/871/orig 2025-12-04T08:54:02.2646394Z * [new branch] gh/eellison/872/base -> origin/gh/eellison/872/base 2025-12-04T08:54:02.2646508Z * [new branch] gh/eellison/872/head -> origin/gh/eellison/872/head 2025-12-04T08:54:02.2646581Z * [new branch] gh/eellison/872/orig -> origin/gh/eellison/872/orig 2025-12-04T08:54:02.2646653Z * [new branch] gh/eellison/873/base -> origin/gh/eellison/873/base 2025-12-04T08:54:02.2646725Z * [new branch] gh/eellison/873/head -> origin/gh/eellison/873/head 2025-12-04T08:54:02.2646794Z * [new branch] gh/eellison/873/orig -> origin/gh/eellison/873/orig 2025-12-04T08:54:02.2646864Z * [new branch] gh/eellison/874/base -> origin/gh/eellison/874/base 2025-12-04T08:54:02.2646938Z * [new branch] gh/eellison/874/head -> origin/gh/eellison/874/head 2025-12-04T08:54:02.2647011Z * [new branch] gh/eellison/874/orig -> origin/gh/eellison/874/orig 2025-12-04T08:54:02.2647081Z * [new branch] gh/eellison/875/base -> origin/gh/eellison/875/base 2025-12-04T08:54:02.2647152Z * [new branch] gh/eellison/875/head -> origin/gh/eellison/875/head 2025-12-04T08:54:02.2647222Z * [new branch] gh/eellison/875/orig -> origin/gh/eellison/875/orig 2025-12-04T08:54:02.2647291Z * [new branch] gh/eellison/876/base -> origin/gh/eellison/876/base 2025-12-04T08:54:02.2647364Z * [new branch] gh/eellison/876/head -> origin/gh/eellison/876/head 2025-12-04T08:54:02.2647435Z * [new branch] gh/eellison/876/orig -> origin/gh/eellison/876/orig 2025-12-04T08:54:02.2647506Z * [new branch] gh/eellison/877/base -> origin/gh/eellison/877/base 2025-12-04T08:54:02.2647578Z * [new branch] gh/eellison/877/head -> origin/gh/eellison/877/head 2025-12-04T08:54:02.2647647Z * [new branch] gh/eellison/877/orig -> origin/gh/eellison/877/orig 2025-12-04T08:54:02.2647719Z * [new branch] gh/eellison/878/base -> origin/gh/eellison/878/base 2025-12-04T08:54:02.2647790Z * [new branch] gh/eellison/878/head -> origin/gh/eellison/878/head 2025-12-04T08:54:02.2647859Z * [new branch] gh/eellison/878/orig -> origin/gh/eellison/878/orig 2025-12-04T08:54:02.2647930Z * [new branch] gh/eellison/879/base -> origin/gh/eellison/879/base 2025-12-04T08:54:02.2648002Z * [new branch] gh/eellison/879/head -> origin/gh/eellison/879/head 2025-12-04T08:54:02.2648074Z * [new branch] gh/eellison/879/orig -> origin/gh/eellison/879/orig 2025-12-04T08:54:02.2648192Z * [new branch] gh/eellison/880/base -> origin/gh/eellison/880/base 2025-12-04T08:54:02.2648263Z * [new branch] gh/eellison/880/head -> origin/gh/eellison/880/head 2025-12-04T08:54:02.2648333Z * [new branch] gh/eellison/880/orig -> origin/gh/eellison/880/orig 2025-12-04T08:54:02.2648408Z * [new branch] gh/eellison/881/base -> origin/gh/eellison/881/base 2025-12-04T08:54:02.2648477Z * [new branch] gh/eellison/881/head -> origin/gh/eellison/881/head 2025-12-04T08:54:02.2648546Z * [new branch] gh/eellison/881/orig -> origin/gh/eellison/881/orig 2025-12-04T08:54:02.2648621Z * [new branch] gh/eellison/882/base -> origin/gh/eellison/882/base 2025-12-04T08:54:02.2648692Z * [new branch] gh/eellison/882/head -> origin/gh/eellison/882/head 2025-12-04T08:54:02.2648761Z * [new branch] gh/eellison/882/orig -> origin/gh/eellison/882/orig 2025-12-04T08:54:02.2648835Z * [new branch] gh/eellison/883/base -> origin/gh/eellison/883/base 2025-12-04T08:54:02.2648904Z * [new branch] gh/eellison/883/head -> origin/gh/eellison/883/head 2025-12-04T08:54:02.2648973Z * [new branch] gh/eellison/883/orig -> origin/gh/eellison/883/orig 2025-12-04T08:54:02.2649408Z * [new branch] gh/eellison/884/base -> origin/gh/eellison/884/base 2025-12-04T08:54:02.2649477Z * [new branch] gh/eellison/884/head -> origin/gh/eellison/884/head 2025-12-04T08:54:02.2649547Z * [new branch] gh/eellison/884/orig -> origin/gh/eellison/884/orig 2025-12-04T08:54:02.2649620Z * [new branch] gh/etaf/147/base -> origin/gh/etaf/147/base 2025-12-04T08:54:02.2649689Z * [new branch] gh/etaf/147/head -> origin/gh/etaf/147/head 2025-12-04T08:54:02.2649753Z * [new branch] gh/etaf/154/base -> origin/gh/etaf/154/base 2025-12-04T08:54:02.2649823Z * [new branch] gh/etaf/154/head -> origin/gh/etaf/154/head 2025-12-04T08:54:02.2649888Z * [new branch] gh/etaf/154/orig -> origin/gh/etaf/154/orig 2025-12-04T08:54:02.2649951Z * [new branch] gh/etaf/156/base -> origin/gh/etaf/156/base 2025-12-04T08:54:02.2650017Z * [new branch] gh/etaf/156/head -> origin/gh/etaf/156/head 2025-12-04T08:54:02.2650081Z * [new branch] gh/etaf/156/orig -> origin/gh/etaf/156/orig 2025-12-04T08:54:02.2650149Z * [new branch] gh/etaf/157/base -> origin/gh/etaf/157/base 2025-12-04T08:54:02.2650214Z * [new branch] gh/etaf/157/head -> origin/gh/etaf/157/head 2025-12-04T08:54:02.2650277Z * [new branch] gh/etaf/157/orig -> origin/gh/etaf/157/orig 2025-12-04T08:54:02.2650342Z * [new branch] gh/etaf/158/base -> origin/gh/etaf/158/base 2025-12-04T08:54:02.2650409Z * [new branch] gh/etaf/158/head -> origin/gh/etaf/158/head 2025-12-04T08:54:02.2650472Z * [new branch] gh/etaf/158/orig -> origin/gh/etaf/158/orig 2025-12-04T08:54:02.2650539Z * [new branch] gh/etaf/159/base -> origin/gh/etaf/159/base 2025-12-04T08:54:02.2650603Z * [new branch] gh/etaf/159/head -> origin/gh/etaf/159/head 2025-12-04T08:54:02.2650669Z * [new branch] gh/etaf/159/orig -> origin/gh/etaf/159/orig 2025-12-04T08:54:02.2650738Z * [new branch] gh/etaf/160/base -> origin/gh/etaf/160/base 2025-12-04T08:54:02.2650803Z * [new branch] gh/etaf/160/head -> origin/gh/etaf/160/head 2025-12-04T08:54:02.2650868Z * [new branch] gh/etaf/160/orig -> origin/gh/etaf/160/orig 2025-12-04T08:54:02.2650933Z * [new branch] gh/etaf/161/base -> origin/gh/etaf/161/base 2025-12-04T08:54:02.2651025Z * [new branch] gh/etaf/161/head -> origin/gh/etaf/161/head 2025-12-04T08:54:02.2651088Z * [new branch] gh/etaf/161/orig -> origin/gh/etaf/161/orig 2025-12-04T08:54:02.2651153Z * [new branch] gh/etaf/166/base -> origin/gh/etaf/166/base 2025-12-04T08:54:02.2651217Z * [new branch] gh/etaf/166/head -> origin/gh/etaf/166/head 2025-12-04T08:54:02.2651283Z * [new branch] gh/etaf/166/orig -> origin/gh/etaf/166/orig 2025-12-04T08:54:02.2651352Z * [new branch] gh/etaf/167/base -> origin/gh/etaf/167/base 2025-12-04T08:54:02.2651416Z * [new branch] gh/etaf/167/head -> origin/gh/etaf/167/head 2025-12-04T08:54:02.2651480Z * [new branch] gh/etaf/167/orig -> origin/gh/etaf/167/orig 2025-12-04T08:54:02.2651545Z * [new branch] gh/etaf/168/base -> origin/gh/etaf/168/base 2025-12-04T08:54:02.2651607Z * [new branch] gh/etaf/168/head -> origin/gh/etaf/168/head 2025-12-04T08:54:02.2651672Z * [new branch] gh/etaf/168/orig -> origin/gh/etaf/168/orig 2025-12-04T08:54:02.2651737Z * [new branch] gh/etaf/172/base -> origin/gh/etaf/172/base 2025-12-04T08:54:02.2651799Z * [new branch] gh/etaf/172/head -> origin/gh/etaf/172/head 2025-12-04T08:54:02.2651907Z * [new branch] gh/etaf/172/orig -> origin/gh/etaf/172/orig 2025-12-04T08:54:02.2651972Z * [new branch] gh/etaf/173/base -> origin/gh/etaf/173/base 2025-12-04T08:54:02.2652036Z * [new branch] gh/etaf/173/head -> origin/gh/etaf/173/head 2025-12-04T08:54:02.2652103Z * [new branch] gh/etaf/173/orig -> origin/gh/etaf/173/orig 2025-12-04T08:54:02.2652167Z * [new branch] gh/etaf/174/base -> origin/gh/etaf/174/base 2025-12-04T08:54:02.2652230Z * [new branch] gh/etaf/174/head -> origin/gh/etaf/174/head 2025-12-04T08:54:02.2652297Z * [new branch] gh/etaf/175/base -> origin/gh/etaf/175/base 2025-12-04T08:54:02.2652360Z * [new branch] gh/etaf/175/head -> origin/gh/etaf/175/head 2025-12-04T08:54:02.2652424Z * [new branch] gh/etaf/175/orig -> origin/gh/etaf/175/orig 2025-12-04T08:54:02.2652494Z * [new branch] gh/etaf/176/base -> origin/gh/etaf/176/base 2025-12-04T08:54:02.2652559Z * [new branch] gh/etaf/176/head -> origin/gh/etaf/176/head 2025-12-04T08:54:02.2652623Z * [new branch] gh/etaf/176/orig -> origin/gh/etaf/176/orig 2025-12-04T08:54:02.2652698Z * [new branch] gh/etaf/177/base -> origin/gh/etaf/177/base 2025-12-04T08:54:02.2652761Z * [new branch] gh/etaf/177/head -> origin/gh/etaf/177/head 2025-12-04T08:54:02.2652824Z * [new branch] gh/etaf/177/orig -> origin/gh/etaf/177/orig 2025-12-04T08:54:02.2652892Z * [new branch] gh/etaf/178/base -> origin/gh/etaf/178/base 2025-12-04T08:54:02.2652957Z * [new branch] gh/etaf/178/head -> origin/gh/etaf/178/head 2025-12-04T08:54:02.2653025Z * [new branch] gh/etaf/178/orig -> origin/gh/etaf/178/orig 2025-12-04T08:54:02.2653093Z * [new branch] gh/etaf/179/base -> origin/gh/etaf/179/base 2025-12-04T08:54:02.2653158Z * [new branch] gh/etaf/179/head -> origin/gh/etaf/179/head 2025-12-04T08:54:02.2653223Z * [new branch] gh/etaf/179/orig -> origin/gh/etaf/179/orig 2025-12-04T08:54:02.2653290Z * [new branch] gh/etaf/180/base -> origin/gh/etaf/180/base 2025-12-04T08:54:02.2653353Z * [new branch] gh/etaf/180/head -> origin/gh/etaf/180/head 2025-12-04T08:54:02.2653417Z * [new branch] gh/etaf/180/orig -> origin/gh/etaf/180/orig 2025-12-04T08:54:02.2653529Z * [new branch] gh/exclamaforte/1/base -> origin/gh/exclamaforte/1/base 2025-12-04T08:54:02.2653609Z * [new branch] gh/exclamaforte/1/head -> origin/gh/exclamaforte/1/head 2025-12-04T08:54:02.2653686Z * [new branch] gh/exclamaforte/2/base -> origin/gh/exclamaforte/2/base 2025-12-04T08:54:02.2653760Z * [new branch] gh/exclamaforte/2/head -> origin/gh/exclamaforte/2/head 2025-12-04T08:54:02.2653835Z * [new branch] gh/exclamaforte/3/base -> origin/gh/exclamaforte/3/base 2025-12-04T08:54:02.2653910Z * [new branch] gh/exclamaforte/3/head -> origin/gh/exclamaforte/3/head 2025-12-04T08:54:02.2653986Z * [new branch] gh/exclamaforte/4/base -> origin/gh/exclamaforte/4/base 2025-12-04T08:54:02.2654062Z * [new branch] gh/exclamaforte/4/head -> origin/gh/exclamaforte/4/head 2025-12-04T08:54:02.2654136Z * [new branch] gh/ezyang/2374/base -> origin/gh/ezyang/2374/base 2025-12-04T08:54:02.2654207Z * [new branch] gh/ezyang/2374/head -> origin/gh/ezyang/2374/head 2025-12-04T08:54:02.2654275Z * [new branch] gh/ezyang/2374/orig -> origin/gh/ezyang/2374/orig 2025-12-04T08:54:02.2654347Z * [new branch] gh/ezyang/2973/base -> origin/gh/ezyang/2973/base 2025-12-04T08:54:02.2654447Z * [new branch] gh/ezyang/2973/head -> origin/gh/ezyang/2973/head 2025-12-04T08:54:02.2654515Z * [new branch] gh/ezyang/2973/orig -> origin/gh/ezyang/2973/orig 2025-12-04T08:54:02.2654585Z * [new branch] gh/ezyang/2974/base -> origin/gh/ezyang/2974/base 2025-12-04T08:54:02.2654654Z * [new branch] gh/ezyang/2974/head -> origin/gh/ezyang/2974/head 2025-12-04T08:54:02.2654723Z * [new branch] gh/ezyang/2974/orig -> origin/gh/ezyang/2974/orig 2025-12-04T08:54:02.2654793Z * [new branch] gh/ezyang/3131/base -> origin/gh/ezyang/3131/base 2025-12-04T08:54:02.2654862Z * [new branch] gh/ezyang/3131/head -> origin/gh/ezyang/3131/head 2025-12-04T08:54:02.2654929Z * [new branch] gh/ezyang/3131/orig -> origin/gh/ezyang/3131/orig 2025-12-04T08:54:02.2654998Z * [new branch] gh/ezyang/3139/base -> origin/gh/ezyang/3139/base 2025-12-04T08:54:02.2655068Z * [new branch] gh/ezyang/3139/head -> origin/gh/ezyang/3139/head 2025-12-04T08:54:02.2655141Z * [new branch] gh/ezyang/3139/orig -> origin/gh/ezyang/3139/orig 2025-12-04T08:54:02.2655210Z * [new branch] gh/ezyang/3140/base -> origin/gh/ezyang/3140/base 2025-12-04T08:54:02.2655277Z * [new branch] gh/ezyang/3140/head -> origin/gh/ezyang/3140/head 2025-12-04T08:54:02.2655346Z * [new branch] gh/ezyang/3140/orig -> origin/gh/ezyang/3140/orig 2025-12-04T08:54:02.2655413Z * [new branch] gh/ezyang/3143/base -> origin/gh/ezyang/3143/base 2025-12-04T08:54:02.2655482Z * [new branch] gh/ezyang/3143/head -> origin/gh/ezyang/3143/head 2025-12-04T08:54:02.2655552Z * [new branch] gh/ezyang/3143/orig -> origin/gh/ezyang/3143/orig 2025-12-04T08:54:02.2655623Z * [new branch] gh/ezyang/3144/base -> origin/gh/ezyang/3144/base 2025-12-04T08:54:02.2655693Z * [new branch] gh/ezyang/3144/head -> origin/gh/ezyang/3144/head 2025-12-04T08:54:02.2655762Z * [new branch] gh/ezyang/3144/orig -> origin/gh/ezyang/3144/orig 2025-12-04T08:54:02.2655830Z * [new branch] gh/ezyang/3167/base -> origin/gh/ezyang/3167/base 2025-12-04T08:54:02.2655898Z * [new branch] gh/ezyang/3167/head -> origin/gh/ezyang/3167/head 2025-12-04T08:54:02.2656013Z * [new branch] gh/ezyang/3167/orig -> origin/gh/ezyang/3167/orig 2025-12-04T08:54:02.2656082Z * [new branch] gh/ezyang/3173/base -> origin/gh/ezyang/3173/base 2025-12-04T08:54:02.2656192Z * [new branch] gh/ezyang/3173/head -> origin/gh/ezyang/3173/head 2025-12-04T08:54:02.2656264Z * [new branch] gh/ezyang/3173/orig -> origin/gh/ezyang/3173/orig 2025-12-04T08:54:02.2656333Z * [new branch] gh/ezyang/3175/base -> origin/gh/ezyang/3175/base 2025-12-04T08:54:02.2656402Z * [new branch] gh/ezyang/3175/head -> origin/gh/ezyang/3175/head 2025-12-04T08:54:02.2656472Z * [new branch] gh/ezyang/3175/orig -> origin/gh/ezyang/3175/orig 2025-12-04T08:54:02.2656539Z * [new branch] gh/ezyang/3182/base -> origin/gh/ezyang/3182/base 2025-12-04T08:54:02.2656606Z * [new branch] gh/ezyang/3182/head -> origin/gh/ezyang/3182/head 2025-12-04T08:54:02.2656676Z * [new branch] gh/ezyang/3182/orig -> origin/gh/ezyang/3182/orig 2025-12-04T08:54:02.2656746Z * [new branch] gh/ezyang/3185/base -> origin/gh/ezyang/3185/base 2025-12-04T08:54:02.2656818Z * [new branch] gh/ezyang/3185/head -> origin/gh/ezyang/3185/head 2025-12-04T08:54:02.2656885Z * [new branch] gh/ezyang/3185/orig -> origin/gh/ezyang/3185/orig 2025-12-04T08:54:02.2656953Z * [new branch] gh/ezyang/3189/base -> origin/gh/ezyang/3189/base 2025-12-04T08:54:02.2657060Z * [new branch] gh/ezyang/3189/head -> origin/gh/ezyang/3189/head 2025-12-04T08:54:02.2657128Z * [new branch] gh/ezyang/3189/orig -> origin/gh/ezyang/3189/orig 2025-12-04T08:54:02.2657196Z * [new branch] gh/ezyang/3191/base -> origin/gh/ezyang/3191/base 2025-12-04T08:54:02.2657268Z * [new branch] gh/ezyang/3191/head -> origin/gh/ezyang/3191/head 2025-12-04T08:54:02.2657336Z * [new branch] gh/ezyang/3191/orig -> origin/gh/ezyang/3191/orig 2025-12-04T08:54:02.2657403Z * [new branch] gh/ezyang/3192/base -> origin/gh/ezyang/3192/base 2025-12-04T08:54:02.2657474Z * [new branch] gh/ezyang/3192/head -> origin/gh/ezyang/3192/head 2025-12-04T08:54:02.2657541Z * [new branch] gh/ezyang/3192/orig -> origin/gh/ezyang/3192/orig 2025-12-04T08:54:02.2657609Z * [new branch] gh/ezyang/3193/base -> origin/gh/ezyang/3193/base 2025-12-04T08:54:02.2657679Z * [new branch] gh/ezyang/3193/head -> origin/gh/ezyang/3193/head 2025-12-04T08:54:02.2657746Z * [new branch] gh/ezyang/3193/orig -> origin/gh/ezyang/3193/orig 2025-12-04T08:54:02.2657815Z * [new branch] gh/ezyang/3194/base -> origin/gh/ezyang/3194/base 2025-12-04T08:54:02.2657888Z * [new branch] gh/ezyang/3194/head -> origin/gh/ezyang/3194/head 2025-12-04T08:54:02.2657955Z * [new branch] gh/ezyang/3194/orig -> origin/gh/ezyang/3194/orig 2025-12-04T08:54:02.2658023Z * [new branch] gh/ezyang/3195/base -> origin/gh/ezyang/3195/base 2025-12-04T08:54:02.2658094Z * [new branch] gh/ezyang/3195/head -> origin/gh/ezyang/3195/head 2025-12-04T08:54:02.2658161Z * [new branch] gh/ezyang/3195/orig -> origin/gh/ezyang/3195/orig 2025-12-04T08:54:02.2658229Z * [new branch] gh/ezyang/3196/base -> origin/gh/ezyang/3196/base 2025-12-04T08:54:02.2658300Z * [new branch] gh/ezyang/3196/head -> origin/gh/ezyang/3196/head 2025-12-04T08:54:02.2658368Z * [new branch] gh/ezyang/3196/orig -> origin/gh/ezyang/3196/orig 2025-12-04T08:54:02.2658440Z * [new branch] gh/ezyang/3197/base -> origin/gh/ezyang/3197/base 2025-12-04T08:54:02.2658506Z * [new branch] gh/ezyang/3197/head -> origin/gh/ezyang/3197/head 2025-12-04T08:54:02.2658573Z * [new branch] gh/ezyang/3197/orig -> origin/gh/ezyang/3197/orig 2025-12-04T08:54:02.2658643Z * [new branch] gh/ezyang/3198/base -> origin/gh/ezyang/3198/base 2025-12-04T08:54:02.2658744Z * [new branch] gh/ezyang/3198/head -> origin/gh/ezyang/3198/head 2025-12-04T08:54:02.2658811Z * [new branch] gh/ezyang/3198/orig -> origin/gh/ezyang/3198/orig 2025-12-04T08:54:02.2658880Z * [new branch] gh/ezyang/3199/base -> origin/gh/ezyang/3199/base 2025-12-04T08:54:02.2658949Z * [new branch] gh/ezyang/3199/head -> origin/gh/ezyang/3199/head 2025-12-04T08:54:02.2659016Z * [new branch] gh/ezyang/3199/orig -> origin/gh/ezyang/3199/orig 2025-12-04T08:54:02.2659090Z * [new branch] gh/ezyang/3200/base -> origin/gh/ezyang/3200/base 2025-12-04T08:54:02.2659159Z * [new branch] gh/ezyang/3200/head -> origin/gh/ezyang/3200/head 2025-12-04T08:54:02.2659226Z * [new branch] gh/ezyang/3200/orig -> origin/gh/ezyang/3200/orig 2025-12-04T08:54:02.2659294Z * [new branch] gh/ezyang/3201/base -> origin/gh/ezyang/3201/base 2025-12-04T08:54:02.2659362Z * [new branch] gh/ezyang/3201/head -> origin/gh/ezyang/3201/head 2025-12-04T08:54:02.2659429Z * [new branch] gh/ezyang/3201/orig -> origin/gh/ezyang/3201/orig 2025-12-04T08:54:02.2659498Z * [new branch] gh/ezyang/3202/base -> origin/gh/ezyang/3202/base 2025-12-04T08:54:02.2659597Z * [new branch] gh/ezyang/3202/head -> origin/gh/ezyang/3202/head 2025-12-04T08:54:02.2659667Z * [new branch] gh/ezyang/3202/orig -> origin/gh/ezyang/3202/orig 2025-12-04T08:54:02.2659735Z * [new branch] gh/ezyang/3203/base -> origin/gh/ezyang/3203/base 2025-12-04T08:54:02.2659802Z * [new branch] gh/ezyang/3203/head -> origin/gh/ezyang/3203/head 2025-12-04T08:54:02.2659869Z * [new branch] gh/ezyang/3203/orig -> origin/gh/ezyang/3203/orig 2025-12-04T08:54:02.2659938Z * [new branch] gh/ezyang/3204/base -> origin/gh/ezyang/3204/base 2025-12-04T08:54:02.2660010Z * [new branch] gh/ezyang/3204/head -> origin/gh/ezyang/3204/head 2025-12-04T08:54:02.2678686Z * [new branch] gh/ezyang/3204/orig -> origin/gh/ezyang/3204/orig 2025-12-04T08:54:02.2678787Z * [new branch] gh/ezyang/3205/base -> origin/gh/ezyang/3205/base 2025-12-04T08:54:02.2678864Z * [new branch] gh/ezyang/3205/head -> origin/gh/ezyang/3205/head 2025-12-04T08:54:02.2678933Z * [new branch] gh/ezyang/3205/orig -> origin/gh/ezyang/3205/orig 2025-12-04T08:54:02.2679004Z * [new branch] gh/ezyang/3206/base -> origin/gh/ezyang/3206/base 2025-12-04T08:54:02.2679071Z * [new branch] gh/ezyang/3206/head -> origin/gh/ezyang/3206/head 2025-12-04T08:54:02.2679140Z * [new branch] gh/ezyang/3206/orig -> origin/gh/ezyang/3206/orig 2025-12-04T08:54:02.2679207Z * [new branch] gh/ezyang/3207/base -> origin/gh/ezyang/3207/base 2025-12-04T08:54:02.2679277Z * [new branch] gh/ezyang/3207/head -> origin/gh/ezyang/3207/head 2025-12-04T08:54:02.2679347Z * [new branch] gh/ezyang/3207/orig -> origin/gh/ezyang/3207/orig 2025-12-04T08:54:02.2679414Z * [new branch] gh/ezyang/3208/base -> origin/gh/ezyang/3208/base 2025-12-04T08:54:02.2679483Z * [new branch] gh/ezyang/3208/head -> origin/gh/ezyang/3208/head 2025-12-04T08:54:02.2679551Z * [new branch] gh/ezyang/3208/orig -> origin/gh/ezyang/3208/orig 2025-12-04T08:54:02.2679620Z * [new branch] gh/ezyang/3209/base -> origin/gh/ezyang/3209/base 2025-12-04T08:54:02.2679687Z * [new branch] gh/ezyang/3209/head -> origin/gh/ezyang/3209/head 2025-12-04T08:54:02.2679757Z * [new branch] gh/ezyang/3209/orig -> origin/gh/ezyang/3209/orig 2025-12-04T08:54:02.2679832Z * [new branch] gh/fadara01/3/base -> origin/gh/fadara01/3/base 2025-12-04T08:54:02.2679975Z * [new branch] gh/fadara01/3/head -> origin/gh/fadara01/3/head 2025-12-04T08:54:02.2680044Z * [new branch] gh/fadara01/3/orig -> origin/gh/fadara01/3/orig 2025-12-04T08:54:02.2680113Z * [new branch] gh/fadara01/5/base -> origin/gh/fadara01/5/base 2025-12-04T08:54:02.2680181Z * [new branch] gh/fadara01/5/head -> origin/gh/fadara01/5/head 2025-12-04T08:54:02.2680251Z * [new branch] gh/fadara01/5/orig -> origin/gh/fadara01/5/orig 2025-12-04T08:54:02.2680319Z * [new branch] gh/fadara01/6/base -> origin/gh/fadara01/6/base 2025-12-04T08:54:02.2680386Z * [new branch] gh/fadara01/6/head -> origin/gh/fadara01/6/head 2025-12-04T08:54:02.2680455Z * [new branch] gh/fadara01/6/orig -> origin/gh/fadara01/6/orig 2025-12-04T08:54:02.2680521Z * [new branch] gh/fadara01/7/base -> origin/gh/fadara01/7/base 2025-12-04T08:54:02.2680591Z * [new branch] gh/fadara01/7/head -> origin/gh/fadara01/7/head 2025-12-04T08:54:02.2680659Z * [new branch] gh/fadara01/7/orig -> origin/gh/fadara01/7/orig 2025-12-04T08:54:02.2680725Z * [new branch] gh/fadara01/8/base -> origin/gh/fadara01/8/base 2025-12-04T08:54:02.2680840Z * [new branch] gh/fadara01/8/head -> origin/gh/fadara01/8/head 2025-12-04T08:54:02.2680909Z * [new branch] gh/fadara01/8/orig -> origin/gh/fadara01/8/orig 2025-12-04T08:54:02.2680978Z * [new branch] gh/fadara01/9/base -> origin/gh/fadara01/9/base 2025-12-04T08:54:02.2681047Z * [new branch] gh/fadara01/9/head -> origin/gh/fadara01/9/head 2025-12-04T08:54:02.2681114Z * [new branch] gh/fadara01/9/orig -> origin/gh/fadara01/9/orig 2025-12-04T08:54:02.2681181Z * [new branch] gh/fduwjj/182/base -> origin/gh/fduwjj/182/base 2025-12-04T08:54:02.2681252Z * [new branch] gh/fduwjj/182/head -> origin/gh/fduwjj/182/head 2025-12-04T08:54:02.2681319Z * [new branch] gh/fduwjj/182/orig -> origin/gh/fduwjj/182/orig 2025-12-04T08:54:02.2681386Z * [new branch] gh/fduwjj/211/base -> origin/gh/fduwjj/211/base 2025-12-04T08:54:02.2681455Z * [new branch] gh/fduwjj/211/head -> origin/gh/fduwjj/211/head 2025-12-04T08:54:02.2681521Z * [new branch] gh/fduwjj/211/orig -> origin/gh/fduwjj/211/orig 2025-12-04T08:54:02.2681587Z * [new branch] gh/fduwjj/212/base -> origin/gh/fduwjj/212/base 2025-12-04T08:54:02.2681655Z * [new branch] gh/fduwjj/212/head -> origin/gh/fduwjj/212/head 2025-12-04T08:54:02.2681721Z * [new branch] gh/fduwjj/212/orig -> origin/gh/fduwjj/212/orig 2025-12-04T08:54:02.2681787Z * [new branch] gh/fduwjj/213/base -> origin/gh/fduwjj/213/base 2025-12-04T08:54:02.2681858Z * [new branch] gh/fduwjj/213/head -> origin/gh/fduwjj/213/head 2025-12-04T08:54:02.2681926Z * [new branch] gh/fduwjj/213/orig -> origin/gh/fduwjj/213/orig 2025-12-04T08:54:02.2681991Z * [new branch] gh/fduwjj/226/base -> origin/gh/fduwjj/226/base 2025-12-04T08:54:02.2682060Z * [new branch] gh/fduwjj/226/head -> origin/gh/fduwjj/226/head 2025-12-04T08:54:02.2682128Z * [new branch] gh/fduwjj/226/orig -> origin/gh/fduwjj/226/orig 2025-12-04T08:54:02.2682195Z * [new branch] gh/fduwjj/229/base -> origin/gh/fduwjj/229/base 2025-12-04T08:54:02.2682264Z * [new branch] gh/fduwjj/229/head -> origin/gh/fduwjj/229/head 2025-12-04T08:54:02.2682331Z * [new branch] gh/fduwjj/229/orig -> origin/gh/fduwjj/229/orig 2025-12-04T08:54:02.2682400Z * [new branch] gh/fduwjj/233/base -> origin/gh/fduwjj/233/base 2025-12-04T08:54:02.2682498Z * [new branch] gh/fduwjj/233/head -> origin/gh/fduwjj/233/head 2025-12-04T08:54:02.2682565Z * [new branch] gh/fduwjj/233/orig -> origin/gh/fduwjj/233/orig 2025-12-04T08:54:02.2682633Z * [new branch] gh/fduwjj/234/base -> origin/gh/fduwjj/234/base 2025-12-04T08:54:02.2682699Z * [new branch] gh/fduwjj/234/head -> origin/gh/fduwjj/234/head 2025-12-04T08:54:02.2682769Z * [new branch] gh/fduwjj/234/orig -> origin/gh/fduwjj/234/orig 2025-12-04T08:54:02.2682837Z * [new branch] gh/fduwjj/235/base -> origin/gh/fduwjj/235/base 2025-12-04T08:54:02.2682904Z * [new branch] gh/fduwjj/235/head -> origin/gh/fduwjj/235/head 2025-12-04T08:54:02.2682969Z * [new branch] gh/fduwjj/235/orig -> origin/gh/fduwjj/235/orig 2025-12-04T08:54:02.2683037Z * [new branch] gh/fduwjj/236/base -> origin/gh/fduwjj/236/base 2025-12-04T08:54:02.2683106Z * [new branch] gh/fduwjj/236/head -> origin/gh/fduwjj/236/head 2025-12-04T08:54:02.2683173Z * [new branch] gh/fduwjj/236/orig -> origin/gh/fduwjj/236/orig 2025-12-04T08:54:02.2683240Z * [new branch] gh/fduwjj/237/base -> origin/gh/fduwjj/237/base 2025-12-04T08:54:02.2683308Z * [new branch] gh/fduwjj/237/head -> origin/gh/fduwjj/237/head 2025-12-04T08:54:02.2683410Z * [new branch] gh/fduwjj/237/orig -> origin/gh/fduwjj/237/orig 2025-12-04T08:54:02.2683482Z * [new branch] gh/fduwjj/238/base -> origin/gh/fduwjj/238/base 2025-12-04T08:54:02.2683548Z * [new branch] gh/fduwjj/238/head -> origin/gh/fduwjj/238/head 2025-12-04T08:54:02.2683615Z * [new branch] gh/fduwjj/238/orig -> origin/gh/fduwjj/238/orig 2025-12-04T08:54:02.2683682Z * [new branch] gh/fduwjj/239/base -> origin/gh/fduwjj/239/base 2025-12-04T08:54:02.2683755Z * [new branch] gh/fduwjj/239/head -> origin/gh/fduwjj/239/head 2025-12-04T08:54:02.2683822Z * [new branch] gh/fduwjj/239/orig -> origin/gh/fduwjj/239/orig 2025-12-04T08:54:02.2683893Z * [new branch] gh/fegin/332/base -> origin/gh/fegin/332/base 2025-12-04T08:54:02.2683960Z * [new branch] gh/fegin/332/head -> origin/gh/fegin/332/head 2025-12-04T08:54:02.2684027Z * [new branch] gh/fegin/332/orig -> origin/gh/fegin/332/orig 2025-12-04T08:54:02.2684095Z * [new branch] gh/fegin/333/base -> origin/gh/fegin/333/base 2025-12-04T08:54:02.2684160Z * [new branch] gh/fegin/333/head -> origin/gh/fegin/333/head 2025-12-04T08:54:02.2684224Z * [new branch] gh/fegin/333/orig -> origin/gh/fegin/333/orig 2025-12-04T08:54:02.2684290Z * [new branch] gh/fegin/334/base -> origin/gh/fegin/334/base 2025-12-04T08:54:02.2684356Z * [new branch] gh/fegin/334/head -> origin/gh/fegin/334/head 2025-12-04T08:54:02.2684421Z * [new branch] gh/fegin/334/orig -> origin/gh/fegin/334/orig 2025-12-04T08:54:02.2684487Z * [new branch] gh/fegin/335/base -> origin/gh/fegin/335/base 2025-12-04T08:54:02.2684550Z * [new branch] gh/fegin/335/head -> origin/gh/fegin/335/head 2025-12-04T08:54:02.2684615Z * [new branch] gh/fegin/335/orig -> origin/gh/fegin/335/orig 2025-12-04T08:54:02.2684684Z * [new branch] gh/fffrog/160/base -> origin/gh/fffrog/160/base 2025-12-04T08:54:02.2684750Z * [new branch] gh/fffrog/160/head -> origin/gh/fffrog/160/head 2025-12-04T08:54:02.2684817Z * [new branch] gh/fffrog/177/base -> origin/gh/fffrog/177/base 2025-12-04T08:54:02.2684887Z * [new branch] gh/fffrog/177/head -> origin/gh/fffrog/177/head 2025-12-04T08:54:02.2684980Z * [new branch] gh/fffrog/177/orig -> origin/gh/fffrog/177/orig 2025-12-04T08:54:02.2685046Z * [new branch] gh/fffrog/178/base -> origin/gh/fffrog/178/base 2025-12-04T08:54:02.2685113Z * [new branch] gh/fffrog/178/head -> origin/gh/fffrog/178/head 2025-12-04T08:54:02.2685179Z * [new branch] gh/fffrog/178/orig -> origin/gh/fffrog/178/orig 2025-12-04T08:54:02.2685248Z * [new branch] gh/fffrog/181/base -> origin/gh/fffrog/181/base 2025-12-04T08:54:02.2685315Z * [new branch] gh/fffrog/181/head -> origin/gh/fffrog/181/head 2025-12-04T08:54:02.2685382Z * [new branch] gh/fffrog/181/orig -> origin/gh/fffrog/181/orig 2025-12-04T08:54:02.2685447Z * [new branch] gh/fffrog/183/base -> origin/gh/fffrog/183/base 2025-12-04T08:54:02.2685516Z * [new branch] gh/fffrog/183/head -> origin/gh/fffrog/183/head 2025-12-04T08:54:02.2685583Z * [new branch] gh/fffrog/183/orig -> origin/gh/fffrog/183/orig 2025-12-04T08:54:02.2685652Z * [new branch] gh/fxdawnn/10/base -> origin/gh/fxdawnn/10/base 2025-12-04T08:54:02.2685718Z * [new branch] gh/fxdawnn/10/head -> origin/gh/fxdawnn/10/head 2025-12-04T08:54:02.2685785Z * [new branch] gh/fxdawnn/10/orig -> origin/gh/fxdawnn/10/orig 2025-12-04T08:54:02.2685887Z * [new branch] gh/fxdawnn/11/base -> origin/gh/fxdawnn/11/base 2025-12-04T08:54:02.2685987Z * [new branch] gh/fxdawnn/11/head -> origin/gh/fxdawnn/11/head 2025-12-04T08:54:02.2686054Z * [new branch] gh/fxdawnn/11/orig -> origin/gh/fxdawnn/11/orig 2025-12-04T08:54:02.2686122Z * [new branch] gh/fxdawnn/12/base -> origin/gh/fxdawnn/12/base 2025-12-04T08:54:02.2686188Z * [new branch] gh/fxdawnn/12/head -> origin/gh/fxdawnn/12/head 2025-12-04T08:54:02.2686255Z * [new branch] gh/fxdawnn/12/orig -> origin/gh/fxdawnn/12/orig 2025-12-04T08:54:02.2686326Z * [new branch] gh/fxdawnn/13/base -> origin/gh/fxdawnn/13/base 2025-12-04T08:54:02.2686392Z * [new branch] gh/fxdawnn/13/head -> origin/gh/fxdawnn/13/head 2025-12-04T08:54:02.2686458Z * [new branch] gh/fxdawnn/13/orig -> origin/gh/fxdawnn/13/orig 2025-12-04T08:54:02.2686527Z * [new branch] gh/fxdawnn/14/base -> origin/gh/fxdawnn/14/base 2025-12-04T08:54:02.2686594Z * [new branch] gh/fxdawnn/14/head -> origin/gh/fxdawnn/14/head 2025-12-04T08:54:02.2686661Z * [new branch] gh/fxdawnn/14/orig -> origin/gh/fxdawnn/14/orig 2025-12-04T08:54:02.2686729Z * [new branch] gh/fxdawnn/15/base -> origin/gh/fxdawnn/15/base 2025-12-04T08:54:02.2686796Z * [new branch] gh/fxdawnn/15/head -> origin/gh/fxdawnn/15/head 2025-12-04T08:54:02.2686863Z * [new branch] gh/fxdawnn/15/orig -> origin/gh/fxdawnn/15/orig 2025-12-04T08:54:02.2686934Z * [new branch] gh/fxdawnn/6/base -> origin/gh/fxdawnn/6/base 2025-12-04T08:54:02.2687001Z * [new branch] gh/fxdawnn/6/head -> origin/gh/fxdawnn/6/head 2025-12-04T08:54:02.2687066Z * [new branch] gh/fxdawnn/6/orig -> origin/gh/fxdawnn/6/orig 2025-12-04T08:54:02.2687135Z * [new branch] gh/fxdawnn/7/base -> origin/gh/fxdawnn/7/base 2025-12-04T08:54:02.2687201Z * [new branch] gh/fxdawnn/7/head -> origin/gh/fxdawnn/7/head 2025-12-04T08:54:02.2687271Z * [new branch] gh/fxdawnn/7/orig -> origin/gh/fxdawnn/7/orig 2025-12-04T08:54:02.2687336Z * [new branch] gh/fxdawnn/9/base -> origin/gh/fxdawnn/9/base 2025-12-04T08:54:02.2687402Z * [new branch] gh/fxdawnn/9/head -> origin/gh/fxdawnn/9/head 2025-12-04T08:54:02.2687474Z * [new branch] gh/fxdawnn/9/orig -> origin/gh/fxdawnn/9/orig 2025-12-04T08:54:02.2687606Z * [new branch] gh/galv/1/base -> origin/gh/galv/1/base 2025-12-04T08:54:02.2687671Z * [new branch] gh/galv/1/head -> origin/gh/galv/1/head 2025-12-04T08:54:02.2687735Z * [new branch] gh/galv/1/orig -> origin/gh/galv/1/orig 2025-12-04T08:54:02.2687799Z * [new branch] gh/galv/2/base -> origin/gh/galv/2/base 2025-12-04T08:54:02.2687860Z * [new branch] gh/galv/2/head -> origin/gh/galv/2/head 2025-12-04T08:54:02.2687923Z * [new branch] gh/galv/2/orig -> origin/gh/galv/2/orig 2025-12-04T08:54:02.2687984Z * [new branch] gh/galv/3/base -> origin/gh/galv/3/base 2025-12-04T08:54:02.2688046Z * [new branch] gh/galv/3/head -> origin/gh/galv/3/head 2025-12-04T08:54:02.2688112Z * [new branch] gh/galv/3/orig -> origin/gh/galv/3/orig 2025-12-04T08:54:02.2688189Z * [new branch] gh/guangyey/134/base -> origin/gh/guangyey/134/base 2025-12-04T08:54:02.2688263Z * [new branch] gh/guangyey/134/head -> origin/gh/guangyey/134/head 2025-12-04T08:54:02.2688336Z * [new branch] gh/guangyey/134/orig -> origin/gh/guangyey/134/orig 2025-12-04T08:54:02.2688406Z * [new branch] gh/guangyey/163/base -> origin/gh/guangyey/163/base 2025-12-04T08:54:02.2688516Z * [new branch] gh/guangyey/163/head -> origin/gh/guangyey/163/head 2025-12-04T08:54:02.2688588Z * [new branch] gh/guangyey/163/orig -> origin/gh/guangyey/163/orig 2025-12-04T08:54:02.2688657Z * [new branch] gh/guangyey/168/base -> origin/gh/guangyey/168/base 2025-12-04T08:54:02.2688726Z * [new branch] gh/guangyey/168/head -> origin/gh/guangyey/168/head 2025-12-04T08:54:02.2688796Z * [new branch] gh/guangyey/168/orig -> origin/gh/guangyey/168/orig 2025-12-04T08:54:02.2688866Z * [new branch] gh/guangyey/169/base -> origin/gh/guangyey/169/base 2025-12-04T08:54:02.2688937Z * [new branch] gh/guangyey/169/head -> origin/gh/guangyey/169/head 2025-12-04T08:54:02.2689006Z * [new branch] gh/guangyey/169/orig -> origin/gh/guangyey/169/orig 2025-12-04T08:54:02.2689076Z * [new branch] gh/guangyey/170/base -> origin/gh/guangyey/170/base 2025-12-04T08:54:02.2689147Z * [new branch] gh/guangyey/170/head -> origin/gh/guangyey/170/head 2025-12-04T08:54:02.2689216Z * [new branch] gh/guangyey/170/orig -> origin/gh/guangyey/170/orig 2025-12-04T08:54:02.2689285Z * [new branch] gh/guangyey/171/base -> origin/gh/guangyey/171/base 2025-12-04T08:54:02.2689355Z * [new branch] gh/guangyey/171/head -> origin/gh/guangyey/171/head 2025-12-04T08:54:02.2689423Z * [new branch] gh/guangyey/171/orig -> origin/gh/guangyey/171/orig 2025-12-04T08:54:02.2689494Z * [new branch] gh/guangyey/178/base -> origin/gh/guangyey/178/base 2025-12-04T08:54:02.2689564Z * [new branch] gh/guangyey/178/head -> origin/gh/guangyey/178/head 2025-12-04T08:54:02.2689633Z * [new branch] gh/guangyey/178/orig -> origin/gh/guangyey/178/orig 2025-12-04T08:54:02.2689703Z * [new branch] gh/guangyey/182/base -> origin/gh/guangyey/182/base 2025-12-04T08:54:02.2689773Z * [new branch] gh/guangyey/182/head -> origin/gh/guangyey/182/head 2025-12-04T08:54:02.2689843Z * [new branch] gh/guangyey/182/orig -> origin/gh/guangyey/182/orig 2025-12-04T08:54:02.2689912Z * [new branch] gh/guangyey/183/base -> origin/gh/guangyey/183/base 2025-12-04T08:54:02.2689982Z * [new branch] gh/guangyey/183/head -> origin/gh/guangyey/183/head 2025-12-04T08:54:02.2690051Z * [new branch] gh/guangyey/183/orig -> origin/gh/guangyey/183/orig 2025-12-04T08:54:02.2690187Z * [new branch] gh/guangyey/185/base -> origin/gh/guangyey/185/base 2025-12-04T08:54:02.2690260Z * [new branch] gh/guangyey/185/head -> origin/gh/guangyey/185/head 2025-12-04T08:54:02.2690329Z * [new branch] gh/guangyey/185/orig -> origin/gh/guangyey/185/orig 2025-12-04T08:54:02.2690409Z * [new branch] gh/guangyey/186/base -> origin/gh/guangyey/186/base 2025-12-04T08:54:02.2690478Z * [new branch] gh/guangyey/186/head -> origin/gh/guangyey/186/head 2025-12-04T08:54:02.2690547Z * [new branch] gh/guangyey/186/orig -> origin/gh/guangyey/186/orig 2025-12-04T08:54:02.2690619Z * [new branch] gh/guangyey/187/base -> origin/gh/guangyey/187/base 2025-12-04T08:54:02.2690690Z * [new branch] gh/guangyey/187/head -> origin/gh/guangyey/187/head 2025-12-04T08:54:02.2690759Z * [new branch] gh/guangyey/187/orig -> origin/gh/guangyey/187/orig 2025-12-04T08:54:02.2690833Z * [new branch] gh/guangyey/188/base -> origin/gh/guangyey/188/base 2025-12-04T08:54:02.2690903Z * [new branch] gh/guangyey/188/head -> origin/gh/guangyey/188/head 2025-12-04T08:54:02.2690973Z * [new branch] gh/guangyey/188/orig -> origin/gh/guangyey/188/orig 2025-12-04T08:54:02.2691074Z * [new branch] gh/guangyey/190/base -> origin/gh/guangyey/190/base 2025-12-04T08:54:02.2691144Z * [new branch] gh/guangyey/190/head -> origin/gh/guangyey/190/head 2025-12-04T08:54:02.2691213Z * [new branch] gh/guangyey/190/orig -> origin/gh/guangyey/190/orig 2025-12-04T08:54:02.2691284Z * [new branch] gh/guangyey/208/base -> origin/gh/guangyey/208/base 2025-12-04T08:54:02.2691352Z * [new branch] gh/guangyey/208/head -> origin/gh/guangyey/208/head 2025-12-04T08:54:02.2691422Z * [new branch] gh/guangyey/208/orig -> origin/gh/guangyey/208/orig 2025-12-04T08:54:02.2691494Z * [new branch] gh/guangyey/228/base -> origin/gh/guangyey/228/base 2025-12-04T08:54:02.2691564Z * [new branch] gh/guangyey/228/head -> origin/gh/guangyey/228/head 2025-12-04T08:54:02.2691632Z * [new branch] gh/guangyey/228/orig -> origin/gh/guangyey/228/orig 2025-12-04T08:54:02.2691707Z * [new branch] gh/guangyey/230/base -> origin/gh/guangyey/230/base 2025-12-04T08:54:02.2691776Z * [new branch] gh/guangyey/230/head -> origin/gh/guangyey/230/head 2025-12-04T08:54:02.2691848Z * [new branch] gh/guangyey/230/orig -> origin/gh/guangyey/230/orig 2025-12-04T08:54:02.2691920Z * [new branch] gh/guangyey/231/base -> origin/gh/guangyey/231/base 2025-12-04T08:54:02.2691989Z * [new branch] gh/guangyey/231/head -> origin/gh/guangyey/231/head 2025-12-04T08:54:02.2692063Z * [new branch] gh/guangyey/231/orig -> origin/gh/guangyey/231/orig 2025-12-04T08:54:02.2692135Z * [new branch] gh/guangyey/232/base -> origin/gh/guangyey/232/base 2025-12-04T08:54:02.2692205Z * [new branch] gh/guangyey/232/head -> origin/gh/guangyey/232/head 2025-12-04T08:54:02.2692277Z * [new branch] gh/guangyey/232/orig -> origin/gh/guangyey/232/orig 2025-12-04T08:54:02.2692349Z * [new branch] gh/guangyey/233/base -> origin/gh/guangyey/233/base 2025-12-04T08:54:02.2692419Z * [new branch] gh/guangyey/233/head -> origin/gh/guangyey/233/head 2025-12-04T08:54:02.2692490Z * [new branch] gh/guangyey/233/orig -> origin/gh/guangyey/233/orig 2025-12-04T08:54:02.2692560Z * [new branch] gh/guangyey/234/base -> origin/gh/guangyey/234/base 2025-12-04T08:54:02.2692630Z * [new branch] gh/guangyey/234/head -> origin/gh/guangyey/234/head 2025-12-04T08:54:02.2692748Z * [new branch] gh/guangyey/234/orig -> origin/gh/guangyey/234/orig 2025-12-04T08:54:02.2692819Z * [new branch] gh/guangyey/235/base -> origin/gh/guangyey/235/base 2025-12-04T08:54:02.2692888Z * [new branch] gh/guangyey/235/head -> origin/gh/guangyey/235/head 2025-12-04T08:54:02.2692959Z * [new branch] gh/guangyey/235/orig -> origin/gh/guangyey/235/orig 2025-12-04T08:54:02.2693030Z * [new branch] gh/guangyey/236/base -> origin/gh/guangyey/236/base 2025-12-04T08:54:02.2693100Z * [new branch] gh/guangyey/236/head -> origin/gh/guangyey/236/head 2025-12-04T08:54:02.2693172Z * [new branch] gh/guangyey/236/orig -> origin/gh/guangyey/236/orig 2025-12-04T08:54:02.2693242Z * [new branch] gh/guangyey/237/base -> origin/gh/guangyey/237/base 2025-12-04T08:54:02.2693312Z * [new branch] gh/guangyey/237/head -> origin/gh/guangyey/237/head 2025-12-04T08:54:02.2693385Z * [new branch] gh/guangyey/237/orig -> origin/gh/guangyey/237/orig 2025-12-04T08:54:02.2693454Z * [new branch] gh/guangyey/238/base -> origin/gh/guangyey/238/base 2025-12-04T08:54:02.2693527Z * [new branch] gh/guangyey/238/head -> origin/gh/guangyey/238/head 2025-12-04T08:54:02.2693598Z * [new branch] gh/guangyey/239/base -> origin/gh/guangyey/239/base 2025-12-04T08:54:02.2693714Z * [new branch] gh/guangyey/239/head -> origin/gh/guangyey/239/head 2025-12-04T08:54:02.2693786Z * [new branch] gh/guangyey/239/orig -> origin/gh/guangyey/239/orig 2025-12-04T08:54:02.2693856Z * [new branch] gh/guangyey/240/base -> origin/gh/guangyey/240/base 2025-12-04T08:54:02.2693926Z * [new branch] gh/guangyey/240/head -> origin/gh/guangyey/240/head 2025-12-04T08:54:02.2693997Z * [new branch] gh/guangyey/240/orig -> origin/gh/guangyey/240/orig 2025-12-04T08:54:02.2694070Z * [new branch] gh/guangyey/241/base -> origin/gh/guangyey/241/base 2025-12-04T08:54:02.2694139Z * [new branch] gh/guangyey/241/head -> origin/gh/guangyey/241/head 2025-12-04T08:54:02.2694211Z * [new branch] gh/guangyey/241/orig -> origin/gh/guangyey/241/orig 2025-12-04T08:54:02.2694280Z * [new branch] gh/guangyey/242/base -> origin/gh/guangyey/242/base 2025-12-04T08:54:02.2694351Z * [new branch] gh/guangyey/242/head -> origin/gh/guangyey/242/head 2025-12-04T08:54:02.2694422Z * [new branch] gh/guangyey/242/orig -> origin/gh/guangyey/242/orig 2025-12-04T08:54:02.2694490Z * [new branch] gh/guangyey/243/base -> origin/gh/guangyey/243/base 2025-12-04T08:54:02.2694558Z * [new branch] gh/guangyey/243/head -> origin/gh/guangyey/243/head 2025-12-04T08:54:02.2694630Z * [new branch] gh/guangyey/243/orig -> origin/gh/guangyey/243/orig 2025-12-04T08:54:02.2694700Z * [new branch] gh/guangyey/244/base -> origin/gh/guangyey/244/base 2025-12-04T08:54:02.2694768Z * [new branch] gh/guangyey/244/head -> origin/gh/guangyey/244/head 2025-12-04T08:54:02.2694838Z * [new branch] gh/guangyey/244/orig -> origin/gh/guangyey/244/orig 2025-12-04T08:54:02.2694907Z * [new branch] gh/guangyey/245/base -> origin/gh/guangyey/245/base 2025-12-04T08:54:02.2694980Z * [new branch] gh/guangyey/245/head -> origin/gh/guangyey/245/head 2025-12-04T08:54:02.2695051Z * [new branch] gh/guangyey/245/orig -> origin/gh/guangyey/245/orig 2025-12-04T08:54:02.2695121Z * [new branch] gh/guangyey/246/base -> origin/gh/guangyey/246/base 2025-12-04T08:54:02.2695192Z * [new branch] gh/guangyey/246/head -> origin/gh/guangyey/246/head 2025-12-04T08:54:02.2695262Z * [new branch] gh/guangyey/246/orig -> origin/gh/guangyey/246/orig 2025-12-04T08:54:02.2695358Z * [new branch] gh/guangyey/247/base -> origin/gh/guangyey/247/base 2025-12-04T08:54:02.2695432Z * [new branch] gh/guangyey/247/head -> origin/gh/guangyey/247/head 2025-12-04T08:54:02.2695503Z * [new branch] gh/guangyey/247/orig -> origin/gh/guangyey/247/orig 2025-12-04T08:54:02.2695574Z * [new branch] gh/guangyey/248/base -> origin/gh/guangyey/248/base 2025-12-04T08:54:02.2695646Z * [new branch] gh/guangyey/248/head -> origin/gh/guangyey/248/head 2025-12-04T08:54:02.2695716Z * [new branch] gh/guangyey/248/orig -> origin/gh/guangyey/248/orig 2025-12-04T08:54:02.2695786Z * [new branch] gh/guangyey/249/base -> origin/gh/guangyey/249/base 2025-12-04T08:54:02.2695856Z * [new branch] gh/guangyey/249/head -> origin/gh/guangyey/249/head 2025-12-04T08:54:02.2695984Z * [new branch] gh/guangyey/249/orig -> origin/gh/guangyey/249/orig 2025-12-04T08:54:02.2696056Z * [new branch] gh/guangyey/250/base -> origin/gh/guangyey/250/base 2025-12-04T08:54:02.2696128Z * [new branch] gh/guangyey/250/head -> origin/gh/guangyey/250/head 2025-12-04T08:54:02.2696198Z * [new branch] gh/guangyey/250/orig -> origin/gh/guangyey/250/orig 2025-12-04T08:54:02.2696308Z * [new branch] gh/guangyey/251/base -> origin/gh/guangyey/251/base 2025-12-04T08:54:02.2696382Z * [new branch] gh/guangyey/251/head -> origin/gh/guangyey/251/head 2025-12-04T08:54:02.2696451Z * [new branch] gh/guangyey/251/orig -> origin/gh/guangyey/251/orig 2025-12-04T08:54:02.2696521Z * [new branch] gh/guangyey/252/base -> origin/gh/guangyey/252/base 2025-12-04T08:54:02.2696592Z * [new branch] gh/guangyey/252/head -> origin/gh/guangyey/252/head 2025-12-04T08:54:02.2696661Z * [new branch] gh/guangyey/252/orig -> origin/gh/guangyey/252/orig 2025-12-04T08:54:02.2696734Z * [new branch] gh/guangyey/253/base -> origin/gh/guangyey/253/base 2025-12-04T08:54:02.2696804Z * [new branch] gh/guangyey/253/head -> origin/gh/guangyey/253/head 2025-12-04T08:54:02.2696872Z * [new branch] gh/guangyey/253/orig -> origin/gh/guangyey/253/orig 2025-12-04T08:54:02.2696945Z * [new branch] gh/guangyey/254/base -> origin/gh/guangyey/254/base 2025-12-04T08:54:02.2697014Z * [new branch] gh/guangyey/254/head -> origin/gh/guangyey/254/head 2025-12-04T08:54:02.2697083Z * [new branch] gh/guangyey/254/orig -> origin/gh/guangyey/254/orig 2025-12-04T08:54:02.2697155Z * [new branch] gh/guangyey/255/base -> origin/gh/guangyey/255/base 2025-12-04T08:54:02.2697223Z * [new branch] gh/guangyey/255/head -> origin/gh/guangyey/255/head 2025-12-04T08:54:02.2697292Z * [new branch] gh/guangyey/255/orig -> origin/gh/guangyey/255/orig 2025-12-04T08:54:02.2697398Z * [new branch] gh/guilhermeleobas/107/base -> origin/gh/guilhermeleobas/107/base 2025-12-04T08:54:02.2697491Z * [new branch] gh/guilhermeleobas/107/head -> origin/gh/guilhermeleobas/107/head 2025-12-04T08:54:02.2697580Z * [new branch] gh/guilhermeleobas/107/orig -> origin/gh/guilhermeleobas/107/orig 2025-12-04T08:54:02.2697672Z * [new branch] gh/guilhermeleobas/108/base -> origin/gh/guilhermeleobas/108/base 2025-12-04T08:54:02.2697761Z * [new branch] gh/guilhermeleobas/108/head -> origin/gh/guilhermeleobas/108/head 2025-12-04T08:54:02.2697848Z * [new branch] gh/guilhermeleobas/108/orig -> origin/gh/guilhermeleobas/108/orig 2025-12-04T08:54:02.2697937Z * [new branch] gh/guilhermeleobas/150/base -> origin/gh/guilhermeleobas/150/base 2025-12-04T08:54:02.2698023Z * [new branch] gh/guilhermeleobas/150/head -> origin/gh/guilhermeleobas/150/head 2025-12-04T08:54:02.2698152Z * [new branch] gh/guilhermeleobas/150/orig -> origin/gh/guilhermeleobas/150/orig 2025-12-04T08:54:02.2698239Z * [new branch] gh/guilhermeleobas/168/base -> origin/gh/guilhermeleobas/168/base 2025-12-04T08:54:02.2698326Z * [new branch] gh/guilhermeleobas/168/head -> origin/gh/guilhermeleobas/168/head 2025-12-04T08:54:02.2698415Z * [new branch] gh/guilhermeleobas/168/orig -> origin/gh/guilhermeleobas/168/orig 2025-12-04T08:54:02.2698502Z * [new branch] gh/guilhermeleobas/169/base -> origin/gh/guilhermeleobas/169/base 2025-12-04T08:54:02.2698588Z * [new branch] gh/guilhermeleobas/169/head -> origin/gh/guilhermeleobas/169/head 2025-12-04T08:54:02.2698676Z * [new branch] gh/guilhermeleobas/169/orig -> origin/gh/guilhermeleobas/169/orig 2025-12-04T08:54:02.2698761Z * [new branch] gh/guilhermeleobas/170/base -> origin/gh/guilhermeleobas/170/base 2025-12-04T08:54:02.2698850Z * [new branch] gh/guilhermeleobas/170/head -> origin/gh/guilhermeleobas/170/head 2025-12-04T08:54:02.2698938Z * [new branch] gh/guilhermeleobas/170/orig -> origin/gh/guilhermeleobas/170/orig 2025-12-04T08:54:02.2699025Z * [new branch] gh/guilhermeleobas/171/base -> origin/gh/guilhermeleobas/171/base 2025-12-04T08:54:02.2699138Z * [new branch] gh/guilhermeleobas/171/head -> origin/gh/guilhermeleobas/171/head 2025-12-04T08:54:02.2699228Z * [new branch] gh/guilhermeleobas/171/orig -> origin/gh/guilhermeleobas/171/orig 2025-12-04T08:54:02.2699315Z * [new branch] gh/guilhermeleobas/173/base -> origin/gh/guilhermeleobas/173/base 2025-12-04T08:54:02.2699403Z * [new branch] gh/guilhermeleobas/173/head -> origin/gh/guilhermeleobas/173/head 2025-12-04T08:54:02.2699491Z * [new branch] gh/guilhermeleobas/173/orig -> origin/gh/guilhermeleobas/173/orig 2025-12-04T08:54:02.2699579Z * [new branch] gh/guilhermeleobas/193/base -> origin/gh/guilhermeleobas/193/base 2025-12-04T08:54:02.2699668Z * [new branch] gh/guilhermeleobas/193/head -> origin/gh/guilhermeleobas/193/head 2025-12-04T08:54:02.2699754Z * [new branch] gh/guilhermeleobas/193/orig -> origin/gh/guilhermeleobas/193/orig 2025-12-04T08:54:02.2699843Z * [new branch] gh/guilhermeleobas/204/base -> origin/gh/guilhermeleobas/204/base 2025-12-04T08:54:02.2699931Z * [new branch] gh/guilhermeleobas/204/head -> origin/gh/guilhermeleobas/204/head 2025-12-04T08:54:02.2700017Z * [new branch] gh/guilhermeleobas/204/orig -> origin/gh/guilhermeleobas/204/orig 2025-12-04T08:54:02.2700103Z * [new branch] gh/guilhermeleobas/211/base -> origin/gh/guilhermeleobas/211/base 2025-12-04T08:54:02.2700192Z * [new branch] gh/guilhermeleobas/211/head -> origin/gh/guilhermeleobas/211/head 2025-12-04T08:54:02.2700279Z * [new branch] gh/guilhermeleobas/211/orig -> origin/gh/guilhermeleobas/211/orig 2025-12-04T08:54:02.2700365Z * [new branch] gh/guilhermeleobas/226/base -> origin/gh/guilhermeleobas/226/base 2025-12-04T08:54:02.2700451Z * [new branch] gh/guilhermeleobas/226/head -> origin/gh/guilhermeleobas/226/head 2025-12-04T08:54:02.2700539Z * [new branch] gh/guilhermeleobas/226/orig -> origin/gh/guilhermeleobas/226/orig 2025-12-04T08:54:02.2700626Z * [new branch] gh/guilhermeleobas/236/base -> origin/gh/guilhermeleobas/236/base 2025-12-04T08:54:02.2700712Z * [new branch] gh/guilhermeleobas/236/head -> origin/gh/guilhermeleobas/236/head 2025-12-04T08:54:02.2700798Z * [new branch] gh/guilhermeleobas/236/orig -> origin/gh/guilhermeleobas/236/orig 2025-12-04T08:54:02.2700884Z * [new branch] gh/guilhermeleobas/247/base -> origin/gh/guilhermeleobas/247/base 2025-12-04T08:54:02.2700999Z * [new branch] gh/guilhermeleobas/247/head -> origin/gh/guilhermeleobas/247/head 2025-12-04T08:54:02.2701086Z * [new branch] gh/guilhermeleobas/247/orig -> origin/gh/guilhermeleobas/247/orig 2025-12-04T08:54:02.2701174Z * [new branch] gh/guilhermeleobas/248/base -> origin/gh/guilhermeleobas/248/base 2025-12-04T08:54:02.2701261Z * [new branch] gh/guilhermeleobas/248/head -> origin/gh/guilhermeleobas/248/head 2025-12-04T08:54:02.2701348Z * [new branch] gh/guilhermeleobas/248/orig -> origin/gh/guilhermeleobas/248/orig 2025-12-04T08:54:02.2701436Z * [new branch] gh/guilhermeleobas/250/base -> origin/gh/guilhermeleobas/250/base 2025-12-04T08:54:02.2701522Z * [new branch] gh/guilhermeleobas/250/head -> origin/gh/guilhermeleobas/250/head 2025-12-04T08:54:02.2701608Z * [new branch] gh/guilhermeleobas/250/orig -> origin/gh/guilhermeleobas/250/orig 2025-12-04T08:54:02.2701696Z * [new branch] gh/guilhermeleobas/253/base -> origin/gh/guilhermeleobas/253/base 2025-12-04T08:54:02.2701784Z * [new branch] gh/guilhermeleobas/253/head -> origin/gh/guilhermeleobas/253/head 2025-12-04T08:54:02.2701870Z * [new branch] gh/guilhermeleobas/253/orig -> origin/gh/guilhermeleobas/253/orig 2025-12-04T08:54:02.2701958Z * [new branch] gh/guilhermeleobas/254/base -> origin/gh/guilhermeleobas/254/base 2025-12-04T08:54:02.2702073Z * [new branch] gh/guilhermeleobas/254/head -> origin/gh/guilhermeleobas/254/head 2025-12-04T08:54:02.2702158Z * [new branch] gh/guilhermeleobas/254/orig -> origin/gh/guilhermeleobas/254/orig 2025-12-04T08:54:02.2702246Z * [new branch] gh/guilhermeleobas/255/base -> origin/gh/guilhermeleobas/255/base 2025-12-04T08:54:02.2702334Z * [new branch] gh/guilhermeleobas/255/head -> origin/gh/guilhermeleobas/255/head 2025-12-04T08:54:02.2702424Z * [new branch] gh/guilhermeleobas/255/orig -> origin/gh/guilhermeleobas/255/orig 2025-12-04T08:54:02.2702514Z * [new branch] gh/guilhermeleobas/256/base -> origin/gh/guilhermeleobas/256/base 2025-12-04T08:54:02.2702600Z * [new branch] gh/guilhermeleobas/256/head -> origin/gh/guilhermeleobas/256/head 2025-12-04T08:54:02.2702688Z * [new branch] gh/guilhermeleobas/256/orig -> origin/gh/guilhermeleobas/256/orig 2025-12-04T08:54:02.2702777Z * [new branch] gh/guilhermeleobas/257/base -> origin/gh/guilhermeleobas/257/base 2025-12-04T08:54:02.2702863Z * [new branch] gh/guilhermeleobas/257/head -> origin/gh/guilhermeleobas/257/head 2025-12-04T08:54:02.2702950Z * [new branch] gh/guilhermeleobas/257/orig -> origin/gh/guilhermeleobas/257/orig 2025-12-04T08:54:02.2703036Z * [new branch] gh/guilhermeleobas/258/base -> origin/gh/guilhermeleobas/258/base 2025-12-04T08:54:02.2703122Z * [new branch] gh/guilhermeleobas/258/head -> origin/gh/guilhermeleobas/258/head 2025-12-04T08:54:02.2703211Z * [new branch] gh/guilhermeleobas/258/orig -> origin/gh/guilhermeleobas/258/orig 2025-12-04T08:54:02.2703298Z * [new branch] gh/guilhermeleobas/259/base -> origin/gh/guilhermeleobas/259/base 2025-12-04T08:54:02.2703384Z * [new branch] gh/guilhermeleobas/259/head -> origin/gh/guilhermeleobas/259/head 2025-12-04T08:54:02.2703473Z * [new branch] gh/guilhermeleobas/259/orig -> origin/gh/guilhermeleobas/259/orig 2025-12-04T08:54:02.2703561Z * [new branch] gh/guilhermeleobas/260/base -> origin/gh/guilhermeleobas/260/base 2025-12-04T08:54:02.2703646Z * [new branch] gh/guilhermeleobas/260/head -> origin/gh/guilhermeleobas/260/head 2025-12-04T08:54:02.2703734Z * [new branch] gh/guilhermeleobas/260/orig -> origin/gh/guilhermeleobas/260/orig 2025-12-04T08:54:02.2703821Z * [new branch] gh/guilhermeleobas/261/base -> origin/gh/guilhermeleobas/261/base 2025-12-04T08:54:02.2703949Z * [new branch] gh/guilhermeleobas/261/head -> origin/gh/guilhermeleobas/261/head 2025-12-04T08:54:02.2704039Z * [new branch] gh/guilhermeleobas/261/orig -> origin/gh/guilhermeleobas/261/orig 2025-12-04T08:54:02.2704125Z * [new branch] gh/guilhermeleobas/262/base -> origin/gh/guilhermeleobas/262/base 2025-12-04T08:54:02.2704215Z * [new branch] gh/guilhermeleobas/262/head -> origin/gh/guilhermeleobas/262/head 2025-12-04T08:54:02.2704301Z * [new branch] gh/guilhermeleobas/262/orig -> origin/gh/guilhermeleobas/262/orig 2025-12-04T08:54:02.2704388Z * [new branch] gh/guilhermeleobas/263/base -> origin/gh/guilhermeleobas/263/base 2025-12-04T08:54:02.2704476Z * [new branch] gh/guilhermeleobas/263/head -> origin/gh/guilhermeleobas/263/head 2025-12-04T08:54:02.2704563Z * [new branch] gh/guilhermeleobas/263/orig -> origin/gh/guilhermeleobas/263/orig 2025-12-04T08:54:02.2704649Z * [new branch] gh/guilhermeleobas/264/base -> origin/gh/guilhermeleobas/264/base 2025-12-04T08:54:02.2704740Z * [new branch] gh/guilhermeleobas/264/head -> origin/gh/guilhermeleobas/264/head 2025-12-04T08:54:02.2704827Z * [new branch] gh/guilhermeleobas/264/orig -> origin/gh/guilhermeleobas/264/orig 2025-12-04T08:54:02.2704937Z * [new branch] gh/guilhermeleobas/265/base -> origin/gh/guilhermeleobas/265/base 2025-12-04T08:54:02.2705027Z * [new branch] gh/guilhermeleobas/265/head -> origin/gh/guilhermeleobas/265/head 2025-12-04T08:54:02.2705113Z * [new branch] gh/guilhermeleobas/265/orig -> origin/gh/guilhermeleobas/265/orig 2025-12-04T08:54:02.2705200Z * [new branch] gh/guilhermeleobas/266/base -> origin/gh/guilhermeleobas/266/base 2025-12-04T08:54:02.2705287Z * [new branch] gh/guilhermeleobas/266/head -> origin/gh/guilhermeleobas/266/head 2025-12-04T08:54:02.2705374Z * [new branch] gh/guilhermeleobas/266/orig -> origin/gh/guilhermeleobas/266/orig 2025-12-04T08:54:02.2705462Z * [new branch] gh/guilhermeleobas/267/base -> origin/gh/guilhermeleobas/267/base 2025-12-04T08:54:02.2705550Z * [new branch] gh/guilhermeleobas/267/head -> origin/gh/guilhermeleobas/267/head 2025-12-04T08:54:02.2705639Z * [new branch] gh/guilhermeleobas/267/orig -> origin/gh/guilhermeleobas/267/orig 2025-12-04T08:54:02.2705724Z * [new branch] gh/hameerabbasi/1/base -> origin/gh/hameerabbasi/1/base 2025-12-04T08:54:02.2705801Z * [new branch] gh/hameerabbasi/1/head -> origin/gh/hameerabbasi/1/head 2025-12-04T08:54:02.2705877Z * [new branch] gh/hameerabbasi/2/base -> origin/gh/hameerabbasi/2/base 2025-12-04T08:54:02.2706002Z * [new branch] gh/hameerabbasi/2/head -> origin/gh/hameerabbasi/2/head 2025-12-04T08:54:02.2706078Z * [new branch] gh/hameerabbasi/2/orig -> origin/gh/hameerabbasi/2/orig 2025-12-04T08:54:02.2706154Z * [new branch] gh/hameerabbasi/3/base -> origin/gh/hameerabbasi/3/base 2025-12-04T08:54:02.2706229Z * [new branch] gh/hameerabbasi/3/head -> origin/gh/hameerabbasi/3/head 2025-12-04T08:54:02.2706303Z * [new branch] gh/hameerabbasi/3/orig -> origin/gh/hameerabbasi/3/orig 2025-12-04T08:54:02.2706378Z * [new branch] gh/hameerabbasi/4/base -> origin/gh/hameerabbasi/4/base 2025-12-04T08:54:02.2706454Z * [new branch] gh/hameerabbasi/4/head -> origin/gh/hameerabbasi/4/head 2025-12-04T08:54:02.2706528Z * [new branch] gh/hameerabbasi/4/orig -> origin/gh/hameerabbasi/4/orig 2025-12-04T08:54:02.2706598Z * [new branch] gh/huydhn/1/next -> origin/gh/huydhn/1/next 2025-12-04T08:54:02.2706669Z * [new branch] gh/huydhn/2/next -> origin/gh/huydhn/2/next 2025-12-04T08:54:02.2706735Z * [new branch] gh/huydhn/3/next -> origin/gh/huydhn/3/next 2025-12-04T08:54:02.2706847Z * [new branch] gh/huydhn/4/next -> origin/gh/huydhn/4/next 2025-12-04T08:54:02.2706917Z * [new branch] gh/huydhn/5/next -> origin/gh/huydhn/5/next 2025-12-04T08:54:02.2706981Z * [new branch] gh/huydhn/6/next -> origin/gh/huydhn/6/next 2025-12-04T08:54:02.2707049Z * [new branch] gh/int3/97/base -> origin/gh/int3/97/base 2025-12-04T08:54:02.2707116Z * [new branch] gh/int3/97/head -> origin/gh/int3/97/head 2025-12-04T08:54:02.2707186Z * [new branch] gh/isuruf/101/base -> origin/gh/isuruf/101/base 2025-12-04T08:54:02.2707255Z * [new branch] gh/isuruf/101/head -> origin/gh/isuruf/101/head 2025-12-04T08:54:02.2707327Z * [new branch] gh/isuruf/146/base -> origin/gh/isuruf/146/base 2025-12-04T08:54:02.2707395Z * [new branch] gh/isuruf/146/head -> origin/gh/isuruf/146/head 2025-12-04T08:54:02.2707466Z * [new branch] gh/isuruf/146/orig -> origin/gh/isuruf/146/orig 2025-12-04T08:54:02.2707532Z * [new branch] gh/isuruf/158/base -> origin/gh/isuruf/158/base 2025-12-04T08:54:02.2707600Z * [new branch] gh/isuruf/158/head -> origin/gh/isuruf/158/head 2025-12-04T08:54:02.2707701Z * [new branch] gh/isuruf/159/base -> origin/gh/isuruf/159/base 2025-12-04T08:54:02.2707768Z * [new branch] gh/isuruf/159/head -> origin/gh/isuruf/159/head 2025-12-04T08:54:02.2707835Z * [new branch] gh/isuruf/160/base -> origin/gh/isuruf/160/base 2025-12-04T08:54:02.2707900Z * [new branch] gh/isuruf/160/head -> origin/gh/isuruf/160/head 2025-12-04T08:54:02.2707966Z * [new branch] gh/isuruf/160/orig -> origin/gh/isuruf/160/orig 2025-12-04T08:54:02.2708035Z * [new branch] gh/isuruf/81/base -> origin/gh/isuruf/81/base 2025-12-04T08:54:02.2708104Z * [new branch] gh/isuruf/81/head -> origin/gh/isuruf/81/head 2025-12-04T08:54:02.2708171Z * [new branch] gh/isuruf/81/orig -> origin/gh/isuruf/81/orig 2025-12-04T08:54:02.2708245Z * [new branch] gh/jamesjwu/176/base -> origin/gh/jamesjwu/176/base 2025-12-04T08:54:02.2708318Z * [new branch] gh/jamesjwu/176/head -> origin/gh/jamesjwu/176/head 2025-12-04T08:54:02.2708388Z * [new branch] gh/jamesjwu/176/orig -> origin/gh/jamesjwu/176/orig 2025-12-04T08:54:02.2708458Z * [new branch] gh/jamesjwu/187/base -> origin/gh/jamesjwu/187/base 2025-12-04T08:54:02.2708528Z * [new branch] gh/jamesjwu/187/head -> origin/gh/jamesjwu/187/head 2025-12-04T08:54:02.2708597Z * [new branch] gh/jamesjwu/187/orig -> origin/gh/jamesjwu/187/orig 2025-12-04T08:54:02.2708666Z * [new branch] gh/jamesjwu/196/base -> origin/gh/jamesjwu/196/base 2025-12-04T08:54:02.2708736Z * [new branch] gh/jamesjwu/196/head -> origin/gh/jamesjwu/196/head 2025-12-04T08:54:02.2708805Z * [new branch] gh/jamesjwu/196/orig -> origin/gh/jamesjwu/196/orig 2025-12-04T08:54:02.2708875Z * [new branch] gh/jamesjwu/198/base -> origin/gh/jamesjwu/198/base 2025-12-04T08:54:02.2708944Z * [new branch] gh/jamesjwu/198/head -> origin/gh/jamesjwu/198/head 2025-12-04T08:54:02.2709014Z * [new branch] gh/jamesjwu/198/orig -> origin/gh/jamesjwu/198/orig 2025-12-04T08:54:02.2709083Z * [new branch] gh/jamesjwu/207/base -> origin/gh/jamesjwu/207/base 2025-12-04T08:54:02.2709151Z * [new branch] gh/jamesjwu/207/head -> origin/gh/jamesjwu/207/head 2025-12-04T08:54:02.2709221Z * [new branch] gh/jamesjwu/207/orig -> origin/gh/jamesjwu/207/orig 2025-12-04T08:54:02.2709288Z * [new branch] gh/jamesjwu/208/base -> origin/gh/jamesjwu/208/base 2025-12-04T08:54:02.2709384Z * [new branch] gh/jamesjwu/208/head -> origin/gh/jamesjwu/208/head 2025-12-04T08:54:02.2709453Z * [new branch] gh/jamesjwu/208/orig -> origin/gh/jamesjwu/208/orig 2025-12-04T08:54:02.2709523Z * [new branch] gh/jamesjwu/52/base -> origin/gh/jamesjwu/52/base 2025-12-04T08:54:02.2709594Z * [new branch] gh/jamesjwu/52/head -> origin/gh/jamesjwu/52/head 2025-12-04T08:54:02.2709664Z * [new branch] gh/jamesjwu/53/base -> origin/gh/jamesjwu/53/base 2025-12-04T08:54:02.2709732Z * [new branch] gh/jamesjwu/53/head -> origin/gh/jamesjwu/53/head 2025-12-04T08:54:02.2709800Z * [new branch] gh/jamesjwu/54/base -> origin/gh/jamesjwu/54/base 2025-12-04T08:54:02.2709869Z * [new branch] gh/jamesjwu/54/head -> origin/gh/jamesjwu/54/head 2025-12-04T08:54:02.2709937Z * [new branch] gh/jamesjwu/55/base -> origin/gh/jamesjwu/55/base 2025-12-04T08:54:02.2710006Z * [new branch] gh/jamesjwu/55/head -> origin/gh/jamesjwu/55/head 2025-12-04T08:54:02.2710076Z * [new branch] gh/jamesjwu/56/base -> origin/gh/jamesjwu/56/base 2025-12-04T08:54:02.2710143Z * [new branch] gh/jamesjwu/56/head -> origin/gh/jamesjwu/56/head 2025-12-04T08:54:02.2710238Z * [new branch] gh/jamesjwu/57/base -> origin/gh/jamesjwu/57/base 2025-12-04T08:54:02.2710308Z * [new branch] gh/jamesjwu/57/head -> origin/gh/jamesjwu/57/head 2025-12-04T08:54:02.2710375Z * [new branch] gh/jamesjwu/58/base -> origin/gh/jamesjwu/58/base 2025-12-04T08:54:02.2710442Z * [new branch] gh/jamesjwu/58/head -> origin/gh/jamesjwu/58/head 2025-12-04T08:54:02.2710512Z * [new branch] gh/jamesjwu/59/base -> origin/gh/jamesjwu/59/base 2025-12-04T08:54:02.2710580Z * [new branch] gh/jamesjwu/59/head -> origin/gh/jamesjwu/59/head 2025-12-04T08:54:02.2710651Z * [new branch] gh/jamesjwu/60/base -> origin/gh/jamesjwu/60/base 2025-12-04T08:54:02.2710718Z * [new branch] gh/jamesjwu/60/head -> origin/gh/jamesjwu/60/head 2025-12-04T08:54:02.2710785Z * [new branch] gh/jamesjwu/61/base -> origin/gh/jamesjwu/61/base 2025-12-04T08:54:02.2710855Z * [new branch] gh/jamesjwu/61/head -> origin/gh/jamesjwu/61/head 2025-12-04T08:54:02.2710923Z * [new branch] gh/jamesjwu/62/base -> origin/gh/jamesjwu/62/base 2025-12-04T08:54:02.2710991Z * [new branch] gh/jamesjwu/62/head -> origin/gh/jamesjwu/62/head 2025-12-04T08:54:02.2711059Z * [new branch] gh/jamesjwu/63/base -> origin/gh/jamesjwu/63/base 2025-12-04T08:54:02.2711127Z * [new branch] gh/jamesjwu/63/head -> origin/gh/jamesjwu/63/head 2025-12-04T08:54:02.2711195Z * [new branch] gh/jamesjwu/64/base -> origin/gh/jamesjwu/64/base 2025-12-04T08:54:02.2711265Z * [new branch] gh/jamesjwu/64/head -> origin/gh/jamesjwu/64/head 2025-12-04T08:54:02.2711334Z * [new branch] gh/jamesjwu/65/base -> origin/gh/jamesjwu/65/base 2025-12-04T08:54:02.2711401Z * [new branch] gh/jamesjwu/65/head -> origin/gh/jamesjwu/65/head 2025-12-04T08:54:02.2711472Z * [new branch] gh/janeyx99/165/base -> origin/gh/janeyx99/165/base 2025-12-04T08:54:02.2711542Z * [new branch] gh/janeyx99/165/head -> origin/gh/janeyx99/165/head 2025-12-04T08:54:02.2711610Z * [new branch] gh/janeyx99/165/orig -> origin/gh/janeyx99/165/orig 2025-12-04T08:54:02.2711679Z * [new branch] gh/janeyx99/201/base -> origin/gh/janeyx99/201/base 2025-12-04T08:54:02.2711747Z * [new branch] gh/janeyx99/201/head -> origin/gh/janeyx99/201/head 2025-12-04T08:54:02.2711815Z * [new branch] gh/janeyx99/201/orig -> origin/gh/janeyx99/201/orig 2025-12-04T08:54:02.2711915Z * [new branch] gh/janeyx99/225/base -> origin/gh/janeyx99/225/base 2025-12-04T08:54:02.2711982Z * [new branch] gh/janeyx99/225/head -> origin/gh/janeyx99/225/head 2025-12-04T08:54:02.2712050Z * [new branch] gh/janeyx99/225/orig -> origin/gh/janeyx99/225/orig 2025-12-04T08:54:02.2712121Z * [new branch] gh/janeyx99/299/base -> origin/gh/janeyx99/299/base 2025-12-04T08:54:02.2712189Z * [new branch] gh/janeyx99/299/head -> origin/gh/janeyx99/299/head 2025-12-04T08:54:02.2712259Z * [new branch] gh/janeyx99/299/orig -> origin/gh/janeyx99/299/orig 2025-12-04T08:54:02.2712326Z * [new branch] gh/janeyx99/302/base -> origin/gh/janeyx99/302/base 2025-12-04T08:54:02.2712394Z * [new branch] gh/janeyx99/302/head -> origin/gh/janeyx99/302/head 2025-12-04T08:54:02.2712464Z * [new branch] gh/janeyx99/303/base -> origin/gh/janeyx99/303/base 2025-12-04T08:54:02.2712532Z * [new branch] gh/janeyx99/303/head -> origin/gh/janeyx99/303/head 2025-12-04T08:54:02.2712599Z * [new branch] gh/janeyx99/305/base -> origin/gh/janeyx99/305/base 2025-12-04T08:54:02.2712667Z * [new branch] gh/janeyx99/305/head -> origin/gh/janeyx99/305/head 2025-12-04T08:54:02.2712766Z * [new branch] gh/janeyx99/306/base -> origin/gh/janeyx99/306/base 2025-12-04T08:54:02.2712835Z * [new branch] gh/janeyx99/306/head -> origin/gh/janeyx99/306/head 2025-12-04T08:54:02.2712905Z * [new branch] gh/janeyx99/314/base -> origin/gh/janeyx99/314/base 2025-12-04T08:54:02.2712972Z * [new branch] gh/janeyx99/314/head -> origin/gh/janeyx99/314/head 2025-12-04T08:54:02.2713041Z * [new branch] gh/janeyx99/314/orig -> origin/gh/janeyx99/314/orig 2025-12-04T08:54:02.2713113Z * [new branch] gh/janeyx99/315/base -> origin/gh/janeyx99/315/base 2025-12-04T08:54:02.2713181Z * [new branch] gh/janeyx99/315/head -> origin/gh/janeyx99/315/head 2025-12-04T08:54:02.2713249Z * [new branch] gh/janeyx99/315/orig -> origin/gh/janeyx99/315/orig 2025-12-04T08:54:02.2713318Z * [new branch] gh/janeyx99/316/base -> origin/gh/janeyx99/316/base 2025-12-04T08:54:02.2713387Z * [new branch] gh/janeyx99/316/head -> origin/gh/janeyx99/316/head 2025-12-04T08:54:02.2713455Z * [new branch] gh/janeyx99/316/orig -> origin/gh/janeyx99/316/orig 2025-12-04T08:54:02.2713524Z * [new branch] gh/janeyx99/317/base -> origin/gh/janeyx99/317/base 2025-12-04T08:54:02.2713593Z * [new branch] gh/janeyx99/317/head -> origin/gh/janeyx99/317/head 2025-12-04T08:54:02.2713662Z * [new branch] gh/janeyx99/317/orig -> origin/gh/janeyx99/317/orig 2025-12-04T08:54:02.2713731Z * [new branch] gh/janeyx99/325/base -> origin/gh/janeyx99/325/base 2025-12-04T08:54:02.2713798Z * [new branch] gh/janeyx99/325/head -> origin/gh/janeyx99/325/head 2025-12-04T08:54:02.2713867Z * [new branch] gh/janeyx99/325/orig -> origin/gh/janeyx99/325/orig 2025-12-04T08:54:02.2713935Z * [new branch] gh/janeyx99/327/base -> origin/gh/janeyx99/327/base 2025-12-04T08:54:02.2714005Z * [new branch] gh/janeyx99/327/head -> origin/gh/janeyx99/327/head 2025-12-04T08:54:02.2714074Z * [new branch] gh/janeyx99/327/orig -> origin/gh/janeyx99/327/orig 2025-12-04T08:54:02.2714141Z * [new branch] gh/janeyx99/328/base -> origin/gh/janeyx99/328/base 2025-12-04T08:54:02.2714209Z * [new branch] gh/janeyx99/328/head -> origin/gh/janeyx99/328/head 2025-12-04T08:54:02.2714278Z * [new branch] gh/janeyx99/328/orig -> origin/gh/janeyx99/328/orig 2025-12-04T08:54:02.2714373Z * [new branch] gh/janeyx99/329/base -> origin/gh/janeyx99/329/base 2025-12-04T08:54:02.2714441Z * [new branch] gh/janeyx99/329/head -> origin/gh/janeyx99/329/head 2025-12-04T08:54:02.2714511Z * [new branch] gh/janeyx99/329/orig -> origin/gh/janeyx99/329/orig 2025-12-04T08:54:02.2714578Z * [new branch] gh/janeyx99/330/base -> origin/gh/janeyx99/330/base 2025-12-04T08:54:02.2714647Z * [new branch] gh/janeyx99/330/head -> origin/gh/janeyx99/330/head 2025-12-04T08:54:02.2714716Z * [new branch] gh/janeyx99/330/orig -> origin/gh/janeyx99/330/orig 2025-12-04T08:54:02.2714784Z * [new branch] gh/janeyx99/331/base -> origin/gh/janeyx99/331/base 2025-12-04T08:54:02.2714852Z * [new branch] gh/janeyx99/331/head -> origin/gh/janeyx99/331/head 2025-12-04T08:54:02.2714922Z * [new branch] gh/janeyx99/331/orig -> origin/gh/janeyx99/331/orig 2025-12-04T08:54:02.2714991Z * [new branch] gh/janeyx99/332/base -> origin/gh/janeyx99/332/base 2025-12-04T08:54:02.2715059Z * [new branch] gh/janeyx99/332/head -> origin/gh/janeyx99/332/head 2025-12-04T08:54:02.2715128Z * [new branch] gh/janeyx99/332/orig -> origin/gh/janeyx99/332/orig 2025-12-04T08:54:02.2715223Z * [new branch] gh/janeyx99/333/base -> origin/gh/janeyx99/333/base 2025-12-04T08:54:02.2715294Z * [new branch] gh/janeyx99/333/head -> origin/gh/janeyx99/333/head 2025-12-04T08:54:02.2715362Z * [new branch] gh/janeyx99/333/orig -> origin/gh/janeyx99/333/orig 2025-12-04T08:54:02.2715430Z * [new branch] gh/janeyx99/88/base -> origin/gh/janeyx99/88/base 2025-12-04T08:54:02.2715500Z * [new branch] gh/janeyx99/88/head -> origin/gh/janeyx99/88/head 2025-12-04T08:54:02.2715567Z * [new branch] gh/janeyx99/88/orig -> origin/gh/janeyx99/88/orig 2025-12-04T08:54:02.2715637Z * [new branch] gh/jansel/360/base -> origin/gh/jansel/360/base 2025-12-04T08:54:02.2715705Z * [new branch] gh/jansel/360/head -> origin/gh/jansel/360/head 2025-12-04T08:54:02.2715771Z * [new branch] gh/jansel/451/base -> origin/gh/jansel/451/base 2025-12-04T08:54:02.2715840Z * [new branch] gh/jansel/451/head -> origin/gh/jansel/451/head 2025-12-04T08:54:02.2715908Z * [new branch] gh/jansel/451/orig -> origin/gh/jansel/451/orig 2025-12-04T08:54:02.2716023Z * [new branch] gh/jansel/462/base -> origin/gh/jansel/462/base 2025-12-04T08:54:02.2716089Z * [new branch] gh/jansel/462/head -> origin/gh/jansel/462/head 2025-12-04T08:54:02.2716156Z * [new branch] gh/jansel/462/orig -> origin/gh/jansel/462/orig 2025-12-04T08:54:02.2716221Z * [new branch] gh/jansel/533/base -> origin/gh/jansel/533/base 2025-12-04T08:54:02.2716288Z * [new branch] gh/jansel/533/head -> origin/gh/jansel/533/head 2025-12-04T08:54:02.2716355Z * [new branch] gh/jansel/533/orig -> origin/gh/jansel/533/orig 2025-12-04T08:54:02.2716420Z * [new branch] gh/jansel/552/base -> origin/gh/jansel/552/base 2025-12-04T08:54:02.2716486Z * [new branch] gh/jansel/552/head -> origin/gh/jansel/552/head 2025-12-04T08:54:02.2716555Z * [new branch] gh/jansel/552/orig -> origin/gh/jansel/552/orig 2025-12-04T08:54:02.2716621Z * [new branch] gh/jansel/553/base -> origin/gh/jansel/553/base 2025-12-04T08:54:02.2716687Z * [new branch] gh/jansel/553/head -> origin/gh/jansel/553/head 2025-12-04T08:54:02.2716754Z * [new branch] gh/jansel/553/orig -> origin/gh/jansel/553/orig 2025-12-04T08:54:02.2716820Z * [new branch] gh/jansel/554/base -> origin/gh/jansel/554/base 2025-12-04T08:54:02.2716937Z * [new branch] gh/jansel/554/head -> origin/gh/jansel/554/head 2025-12-04T08:54:02.2717003Z * [new branch] gh/jansel/554/orig -> origin/gh/jansel/554/orig 2025-12-04T08:54:02.2717068Z * [new branch] gh/jansel/555/base -> origin/gh/jansel/555/base 2025-12-04T08:54:02.2717136Z * [new branch] gh/jansel/555/head -> origin/gh/jansel/555/head 2025-12-04T08:54:02.2717203Z * [new branch] gh/jansel/555/orig -> origin/gh/jansel/555/orig 2025-12-04T08:54:02.2717270Z * [new branch] gh/jansel/556/base -> origin/gh/jansel/556/base 2025-12-04T08:54:02.2717337Z * [new branch] gh/jansel/556/head -> origin/gh/jansel/556/head 2025-12-04T08:54:02.2717403Z * [new branch] gh/jansel/556/orig -> origin/gh/jansel/556/orig 2025-12-04T08:54:02.2717468Z * [new branch] gh/jansel/557/base -> origin/gh/jansel/557/base 2025-12-04T08:54:02.2717537Z * [new branch] gh/jansel/557/head -> origin/gh/jansel/557/head 2025-12-04T08:54:02.2717602Z * [new branch] gh/jansel/557/orig -> origin/gh/jansel/557/orig 2025-12-04T08:54:02.2717668Z * [new branch] gh/jansel/558/base -> origin/gh/jansel/558/base 2025-12-04T08:54:02.2717734Z * [new branch] gh/jansel/558/head -> origin/gh/jansel/558/head 2025-12-04T08:54:02.2717837Z * [new branch] gh/jansel/558/orig -> origin/gh/jansel/558/orig 2025-12-04T08:54:02.2717904Z * [new branch] gh/jansel/559/base -> origin/gh/jansel/559/base 2025-12-04T08:54:02.2717971Z * [new branch] gh/jansel/559/head -> origin/gh/jansel/559/head 2025-12-04T08:54:02.2718038Z * [new branch] gh/jansel/559/orig -> origin/gh/jansel/559/orig 2025-12-04T08:54:02.2718104Z * [new branch] gh/jansel/560/base -> origin/gh/jansel/560/base 2025-12-04T08:54:02.2718172Z * [new branch] gh/jansel/560/head -> origin/gh/jansel/560/head 2025-12-04T08:54:02.2718238Z * [new branch] gh/jansel/560/orig -> origin/gh/jansel/560/orig 2025-12-04T08:54:02.2718303Z * [new branch] gh/jansel/561/base -> origin/gh/jansel/561/base 2025-12-04T08:54:02.2718372Z * [new branch] gh/jansel/561/head -> origin/gh/jansel/561/head 2025-12-04T08:54:02.2718439Z * [new branch] gh/jansel/561/orig -> origin/gh/jansel/561/orig 2025-12-04T08:54:02.2718506Z * [new branch] gh/jansel/562/base -> origin/gh/jansel/562/base 2025-12-04T08:54:02.2718571Z * [new branch] gh/jansel/562/head -> origin/gh/jansel/562/head 2025-12-04T08:54:02.2718637Z * [new branch] gh/jansel/562/orig -> origin/gh/jansel/562/orig 2025-12-04T08:54:02.2718705Z * [new branch] gh/jansel/563/base -> origin/gh/jansel/563/base 2025-12-04T08:54:02.2718773Z * [new branch] gh/jansel/563/head -> origin/gh/jansel/563/head 2025-12-04T08:54:02.2718838Z * [new branch] gh/jansel/563/orig -> origin/gh/jansel/563/orig 2025-12-04T08:54:02.2718905Z * [new branch] gh/jansel/564/base -> origin/gh/jansel/564/base 2025-12-04T08:54:02.2718970Z * [new branch] gh/jansel/564/head -> origin/gh/jansel/564/head 2025-12-04T08:54:02.2719037Z * [new branch] gh/jansel/564/orig -> origin/gh/jansel/564/orig 2025-12-04T08:54:02.2719104Z * [new branch] gh/jansel/565/base -> origin/gh/jansel/565/base 2025-12-04T08:54:02.2719169Z * [new branch] gh/jansel/565/head -> origin/gh/jansel/565/head 2025-12-04T08:54:02.2719235Z * [new branch] gh/jansel/565/orig -> origin/gh/jansel/565/orig 2025-12-04T08:54:02.2719302Z * [new branch] gh/jansel/566/base -> origin/gh/jansel/566/base 2025-12-04T08:54:02.2719397Z * [new branch] gh/jansel/566/head -> origin/gh/jansel/566/head 2025-12-04T08:54:02.2719462Z * [new branch] gh/jansel/566/orig -> origin/gh/jansel/566/orig 2025-12-04T08:54:02.2719530Z * [new branch] gh/jansel/567/base -> origin/gh/jansel/567/base 2025-12-04T08:54:02.2719596Z * [new branch] gh/jansel/567/head -> origin/gh/jansel/567/head 2025-12-04T08:54:02.2719663Z * [new branch] gh/jansel/567/orig -> origin/gh/jansel/567/orig 2025-12-04T08:54:02.2719730Z * [new branch] gh/jansel/568/base -> origin/gh/jansel/568/base 2025-12-04T08:54:02.2719797Z * [new branch] gh/jansel/568/head -> origin/gh/jansel/568/head 2025-12-04T08:54:02.2719862Z * [new branch] gh/jansel/568/orig -> origin/gh/jansel/568/orig 2025-12-04T08:54:02.2719930Z * [new branch] gh/jansel/569/base -> origin/gh/jansel/569/base 2025-12-04T08:54:02.2719997Z * [new branch] gh/jansel/569/head -> origin/gh/jansel/569/head 2025-12-04T08:54:02.2720064Z * [new branch] gh/jansel/569/orig -> origin/gh/jansel/569/orig 2025-12-04T08:54:02.2720130Z * [new branch] gh/jansel/570/base -> origin/gh/jansel/570/base 2025-12-04T08:54:02.2720196Z * [new branch] gh/jansel/570/head -> origin/gh/jansel/570/head 2025-12-04T08:54:02.2720294Z * [new branch] gh/jansel/570/orig -> origin/gh/jansel/570/orig 2025-12-04T08:54:02.2720360Z * [new branch] gh/jansel/571/base -> origin/gh/jansel/571/base 2025-12-04T08:54:02.2720426Z * [new branch] gh/jansel/571/head -> origin/gh/jansel/571/head 2025-12-04T08:54:02.2720493Z * [new branch] gh/jansel/571/orig -> origin/gh/jansel/571/orig 2025-12-04T08:54:02.2720558Z * [new branch] gh/jansel/572/base -> origin/gh/jansel/572/base 2025-12-04T08:54:02.2720626Z * [new branch] gh/jansel/572/head -> origin/gh/jansel/572/head 2025-12-04T08:54:02.2720694Z * [new branch] gh/jansel/572/orig -> origin/gh/jansel/572/orig 2025-12-04T08:54:02.2720760Z * [new branch] gh/jansel/573/base -> origin/gh/jansel/573/base 2025-12-04T08:54:02.2720826Z * [new branch] gh/jansel/573/head -> origin/gh/jansel/573/head 2025-12-04T08:54:02.2720896Z * [new branch] gh/jansel/573/orig -> origin/gh/jansel/573/orig 2025-12-04T08:54:02.2720962Z * [new branch] gh/jansel/574/base -> origin/gh/jansel/574/base 2025-12-04T08:54:02.2721027Z * [new branch] gh/jansel/574/head -> origin/gh/jansel/574/head 2025-12-04T08:54:02.2721094Z * [new branch] gh/jansel/574/orig -> origin/gh/jansel/574/orig 2025-12-04T08:54:02.2721160Z * [new branch] gh/jansel/575/base -> origin/gh/jansel/575/base 2025-12-04T08:54:02.2721225Z * [new branch] gh/jansel/575/head -> origin/gh/jansel/575/head 2025-12-04T08:54:02.2721293Z * [new branch] gh/jansel/575/orig -> origin/gh/jansel/575/orig 2025-12-04T08:54:02.2721358Z * [new branch] gh/jansel/576/base -> origin/gh/jansel/576/base 2025-12-04T08:54:02.2721424Z * [new branch] gh/jansel/576/head -> origin/gh/jansel/576/head 2025-12-04T08:54:02.2721492Z * [new branch] gh/jansel/576/orig -> origin/gh/jansel/576/orig 2025-12-04T08:54:02.2721572Z * [new branch] gh/jbschlosser/247/base -> origin/gh/jbschlosser/247/base 2025-12-04T08:54:02.2721650Z * [new branch] gh/jbschlosser/247/head -> origin/gh/jbschlosser/247/head 2025-12-04T08:54:02.2721727Z * [new branch] gh/jbschlosser/247/orig -> origin/gh/jbschlosser/247/orig 2025-12-04T08:54:02.2721802Z * [new branch] gh/jbschlosser/250/base -> origin/gh/jbschlosser/250/base 2025-12-04T08:54:02.2721904Z * [new branch] gh/jbschlosser/250/head -> origin/gh/jbschlosser/250/head 2025-12-04T08:54:02.2721978Z * [new branch] gh/jbschlosser/250/orig -> origin/gh/jbschlosser/250/orig 2025-12-04T08:54:02.2722050Z * [new branch] gh/jerryzh168/1/base -> origin/gh/jerryzh168/1/base 2025-12-04T08:54:02.2722121Z * [new branch] gh/jerryzh168/1/head -> origin/gh/jerryzh168/1/head 2025-12-04T08:54:02.2722192Z * [new branch] gh/jerryzh168/1/orig -> origin/gh/jerryzh168/1/orig 2025-12-04T08:54:02.2722263Z * [new branch] gh/jiayisunx/59/base -> origin/gh/jiayisunx/59/base 2025-12-04T08:54:02.2722334Z * [new branch] gh/jiayisunx/59/head -> origin/gh/jiayisunx/59/head 2025-12-04T08:54:02.2722403Z * [new branch] gh/jiayisunx/59/orig -> origin/gh/jiayisunx/59/orig 2025-12-04T08:54:02.2722472Z * [new branch] gh/jiayisunx/61/base -> origin/gh/jiayisunx/61/base 2025-12-04T08:54:02.2722546Z * [new branch] gh/jiayisunx/61/head -> origin/gh/jiayisunx/61/head 2025-12-04T08:54:02.2722615Z * [new branch] gh/jiayisunx/61/orig -> origin/gh/jiayisunx/61/orig 2025-12-04T08:54:02.2722684Z * [new branch] gh/jiayisunx/68/base -> origin/gh/jiayisunx/68/base 2025-12-04T08:54:02.2722753Z * [new branch] gh/jiayisunx/68/head -> origin/gh/jiayisunx/68/head 2025-12-04T08:54:02.2722858Z * [new branch] gh/jiayisunx/68/orig -> origin/gh/jiayisunx/68/orig 2025-12-04T08:54:02.2722929Z * [new branch] gh/jiayisunx/77/base -> origin/gh/jiayisunx/77/base 2025-12-04T08:54:02.2722999Z * [new branch] gh/jiayisunx/77/head -> origin/gh/jiayisunx/77/head 2025-12-04T08:54:02.2723068Z * [new branch] gh/jiayisunx/77/orig -> origin/gh/jiayisunx/77/orig 2025-12-04T08:54:02.2723138Z * [new branch] gh/jiayisunx/78/base -> origin/gh/jiayisunx/78/base 2025-12-04T08:54:02.2723212Z * [new branch] gh/jiayisunx/78/head -> origin/gh/jiayisunx/78/head 2025-12-04T08:54:02.2723281Z * [new branch] gh/jiayisunx/78/orig -> origin/gh/jiayisunx/78/orig 2025-12-04T08:54:02.2723352Z * [new branch] gh/jiayisunx/79/base -> origin/gh/jiayisunx/79/base 2025-12-04T08:54:02.2723421Z * [new branch] gh/jiayisunx/79/head -> origin/gh/jiayisunx/79/head 2025-12-04T08:54:02.2723492Z * [new branch] gh/jiayisunx/79/orig -> origin/gh/jiayisunx/79/orig 2025-12-04T08:54:02.2723563Z * [new branch] gh/jiayisunx/82/base -> origin/gh/jiayisunx/82/base 2025-12-04T08:54:02.2723632Z * [new branch] gh/jiayisunx/82/head -> origin/gh/jiayisunx/82/head 2025-12-04T08:54:02.2723702Z * [new branch] gh/jiayisunx/82/orig -> origin/gh/jiayisunx/82/orig 2025-12-04T08:54:02.2723772Z * [new branch] gh/jiayisunx/83/base -> origin/gh/jiayisunx/83/base 2025-12-04T08:54:02.2723843Z * [new branch] gh/jiayisunx/83/head -> origin/gh/jiayisunx/83/head 2025-12-04T08:54:02.2723912Z * [new branch] gh/jiayisunx/83/orig -> origin/gh/jiayisunx/83/orig 2025-12-04T08:54:02.2723984Z * [new branch] gh/jiayisunx/84/base -> origin/gh/jiayisunx/84/base 2025-12-04T08:54:02.2724055Z * [new branch] gh/jiayisunx/84/head -> origin/gh/jiayisunx/84/head 2025-12-04T08:54:02.2724123Z * [new branch] gh/jiayisunx/84/orig -> origin/gh/jiayisunx/84/orig 2025-12-04T08:54:02.2724194Z * [new branch] gh/jiayisunx/85/base -> origin/gh/jiayisunx/85/base 2025-12-04T08:54:02.2724263Z * [new branch] gh/jiayisunx/85/head -> origin/gh/jiayisunx/85/head 2025-12-04T08:54:02.2724333Z * [new branch] gh/jiayisunx/85/orig -> origin/gh/jiayisunx/85/orig 2025-12-04T08:54:02.2724403Z * [new branch] gh/jiayisunx/86/base -> origin/gh/jiayisunx/86/base 2025-12-04T08:54:02.2724499Z * [new branch] gh/jiayisunx/86/head -> origin/gh/jiayisunx/86/head 2025-12-04T08:54:02.2724569Z * [new branch] gh/jiayisunx/86/orig -> origin/gh/jiayisunx/86/orig 2025-12-04T08:54:02.2724640Z * [new branch] gh/jiayisunx/87/base -> origin/gh/jiayisunx/87/base 2025-12-04T08:54:02.2724711Z * [new branch] gh/jiayisunx/87/head -> origin/gh/jiayisunx/87/head 2025-12-04T08:54:02.2724781Z * [new branch] gh/jiayisunx/87/orig -> origin/gh/jiayisunx/87/orig 2025-12-04T08:54:02.2724851Z * [new branch] gh/jiayisunx/88/base -> origin/gh/jiayisunx/88/base 2025-12-04T08:54:02.2724920Z * [new branch] gh/jiayisunx/88/head -> origin/gh/jiayisunx/88/head 2025-12-04T08:54:02.2724991Z * [new branch] gh/jiayisunx/88/orig -> origin/gh/jiayisunx/88/orig 2025-12-04T08:54:02.2725060Z * [new branch] gh/jiayisunx/89/base -> origin/gh/jiayisunx/89/base 2025-12-04T08:54:02.2725130Z * [new branch] gh/jiayisunx/89/head -> origin/gh/jiayisunx/89/head 2025-12-04T08:54:02.2725201Z * [new branch] gh/jiayisunx/89/orig -> origin/gh/jiayisunx/89/orig 2025-12-04T08:54:02.2725270Z * [new branch] gh/jiayisunx/90/base -> origin/gh/jiayisunx/90/base 2025-12-04T08:54:02.2725377Z * [new branch] gh/jiayisunx/90/head -> origin/gh/jiayisunx/90/head 2025-12-04T08:54:02.2725449Z * [new branch] gh/jiayisunx/90/orig -> origin/gh/jiayisunx/90/orig 2025-12-04T08:54:02.2725524Z * [new branch] gh/jjwu@meta.com/1/base -> origin/gh/jjwu@meta.com/1/base 2025-12-04T08:54:02.2725597Z * [new branch] gh/jjwu@meta.com/1/head -> origin/gh/jjwu@meta.com/1/head 2025-12-04T08:54:02.2725667Z * [new branch] gh/jturney/1/base -> origin/gh/jturney/1/base 2025-12-04T08:54:02.2725736Z * [new branch] gh/jturney/1/head -> origin/gh/jturney/1/head 2025-12-04T08:54:02.2725804Z * [new branch] gh/jturney/1/orig -> origin/gh/jturney/1/orig 2025-12-04T08:54:02.2725872Z * [new branch] gh/jturney/2/base -> origin/gh/jturney/2/base 2025-12-04T08:54:02.2725986Z * [new branch] gh/jturney/2/head -> origin/gh/jturney/2/head 2025-12-04T08:54:02.2726053Z * [new branch] gh/jturney/2/orig -> origin/gh/jturney/2/orig 2025-12-04T08:54:02.2726131Z * [new branch] gh/karthickai/10/base -> origin/gh/karthickai/10/base 2025-12-04T08:54:02.2726206Z * [new branch] gh/karthickai/10/head -> origin/gh/karthickai/10/head 2025-12-04T08:54:02.2726279Z * [new branch] gh/karthickai/10/orig -> origin/gh/karthickai/10/orig 2025-12-04T08:54:02.2726355Z * [new branch] gh/karthickai/11/base -> origin/gh/karthickai/11/base 2025-12-04T08:54:02.2726426Z * [new branch] gh/karthickai/11/head -> origin/gh/karthickai/11/head 2025-12-04T08:54:02.2726500Z * [new branch] gh/karthickai/11/orig -> origin/gh/karthickai/11/orig 2025-12-04T08:54:02.2726573Z * [new branch] gh/karthickai/12/base -> origin/gh/karthickai/12/base 2025-12-04T08:54:02.2726644Z * [new branch] gh/karthickai/12/head -> origin/gh/karthickai/12/head 2025-12-04T08:54:02.2726718Z * [new branch] gh/karthickai/12/orig -> origin/gh/karthickai/12/orig 2025-12-04T08:54:02.2726789Z * [new branch] gh/karthickai/13/base -> origin/gh/karthickai/13/base 2025-12-04T08:54:02.2726860Z * [new branch] gh/karthickai/13/head -> origin/gh/karthickai/13/head 2025-12-04T08:54:02.2726931Z * [new branch] gh/karthickai/13/orig -> origin/gh/karthickai/13/orig 2025-12-04T08:54:02.2727003Z * [new branch] gh/karthickai/14/base -> origin/gh/karthickai/14/base 2025-12-04T08:54:02.2727074Z * [new branch] gh/karthickai/14/head -> origin/gh/karthickai/14/head 2025-12-04T08:54:02.2727192Z * [new branch] gh/karthickai/14/orig -> origin/gh/karthickai/14/orig 2025-12-04T08:54:02.2727264Z * [new branch] gh/karthickai/15/base -> origin/gh/karthickai/15/base 2025-12-04T08:54:02.2727334Z * [new branch] gh/karthickai/15/head -> origin/gh/karthickai/15/head 2025-12-04T08:54:02.2727408Z * [new branch] gh/karthickai/15/orig -> origin/gh/karthickai/15/orig 2025-12-04T08:54:02.2727480Z * [new branch] gh/karthickai/16/base -> origin/gh/karthickai/16/base 2025-12-04T08:54:02.2727551Z * [new branch] gh/karthickai/16/head -> origin/gh/karthickai/16/head 2025-12-04T08:54:02.2727624Z * [new branch] gh/karthickai/16/orig -> origin/gh/karthickai/16/orig 2025-12-04T08:54:02.2727695Z * [new branch] gh/karthickai/17/base -> origin/gh/karthickai/17/base 2025-12-04T08:54:02.2727768Z * [new branch] gh/karthickai/17/head -> origin/gh/karthickai/17/head 2025-12-04T08:54:02.2727840Z * [new branch] gh/karthickai/17/orig -> origin/gh/karthickai/17/orig 2025-12-04T08:54:02.2727912Z * [new branch] gh/karthickai/18/base -> origin/gh/karthickai/18/base 2025-12-04T08:54:02.2727985Z * [new branch] gh/karthickai/18/head -> origin/gh/karthickai/18/head 2025-12-04T08:54:02.2728101Z * [new branch] gh/karthickai/18/orig -> origin/gh/karthickai/18/orig 2025-12-04T08:54:02.2728173Z * [new branch] gh/karthickai/19/base -> origin/gh/karthickai/19/base 2025-12-04T08:54:02.2728245Z * [new branch] gh/karthickai/19/head -> origin/gh/karthickai/19/head 2025-12-04T08:54:02.2728317Z * [new branch] gh/karthickai/19/orig -> origin/gh/karthickai/19/orig 2025-12-04T08:54:02.2728389Z * [new branch] gh/karthickai/20/base -> origin/gh/karthickai/20/base 2025-12-04T08:54:02.2728463Z * [new branch] gh/karthickai/20/head -> origin/gh/karthickai/20/head 2025-12-04T08:54:02.2728534Z * [new branch] gh/karthickai/20/orig -> origin/gh/karthickai/20/orig 2025-12-04T08:54:02.2728605Z * [new branch] gh/karthickai/21/base -> origin/gh/karthickai/21/base 2025-12-04T08:54:02.2728678Z * [new branch] gh/karthickai/21/head -> origin/gh/karthickai/21/head 2025-12-04T08:54:02.2728750Z * [new branch] gh/karthickai/21/orig -> origin/gh/karthickai/21/orig 2025-12-04T08:54:02.2728822Z * [new branch] gh/karthickai/22/base -> origin/gh/karthickai/22/base 2025-12-04T08:54:02.2728895Z * [new branch] gh/karthickai/22/head -> origin/gh/karthickai/22/head 2025-12-04T08:54:02.2728967Z * [new branch] gh/karthickai/22/orig -> origin/gh/karthickai/22/orig 2025-12-04T08:54:02.2729038Z * [new branch] gh/karthickai/23/base -> origin/gh/karthickai/23/base 2025-12-04T08:54:02.2729113Z * [new branch] gh/karthickai/23/head -> origin/gh/karthickai/23/head 2025-12-04T08:54:02.2729184Z * [new branch] gh/karthickai/23/orig -> origin/gh/karthickai/23/orig 2025-12-04T08:54:02.2729255Z * [new branch] gh/karthickai/24/base -> origin/gh/karthickai/24/base 2025-12-04T08:54:02.2729329Z * [new branch] gh/karthickai/24/head -> origin/gh/karthickai/24/head 2025-12-04T08:54:02.2729402Z * [new branch] gh/karthickai/24/orig -> origin/gh/karthickai/24/orig 2025-12-04T08:54:02.2729474Z * [new branch] gh/karthickai/25/base -> origin/gh/karthickai/25/base 2025-12-04T08:54:02.2729546Z * [new branch] gh/karthickai/25/head -> origin/gh/karthickai/25/head 2025-12-04T08:54:02.2729616Z * [new branch] gh/karthickai/25/orig -> origin/gh/karthickai/25/orig 2025-12-04T08:54:02.2729689Z * [new branch] gh/karthickai/26/base -> origin/gh/karthickai/26/base 2025-12-04T08:54:02.2729790Z * [new branch] gh/karthickai/26/head -> origin/gh/karthickai/26/head 2025-12-04T08:54:02.2729862Z * [new branch] gh/karthickai/26/orig -> origin/gh/karthickai/26/orig 2025-12-04T08:54:02.2729934Z * [new branch] gh/karthickai/6/base -> origin/gh/karthickai/6/base 2025-12-04T08:54:02.2730007Z * [new branch] gh/karthickai/6/head -> origin/gh/karthickai/6/head 2025-12-04T08:54:02.2730077Z * [new branch] gh/karthickai/6/orig -> origin/gh/karthickai/6/orig 2025-12-04T08:54:02.2730146Z * [new branch] gh/krocki/1/base -> origin/gh/krocki/1/base 2025-12-04T08:54:02.2730212Z * [new branch] gh/krocki/1/head -> origin/gh/krocki/1/head 2025-12-04T08:54:02.2730277Z * [new branch] gh/krocki/1/orig -> origin/gh/krocki/1/orig 2025-12-04T08:54:02.2730343Z * [new branch] gh/krocki/2/base -> origin/gh/krocki/2/base 2025-12-04T08:54:02.2730410Z * [new branch] gh/krocki/2/head -> origin/gh/krocki/2/head 2025-12-04T08:54:02.2730474Z * [new branch] gh/krocki/2/orig -> origin/gh/krocki/2/orig 2025-12-04T08:54:02.2730554Z * [new branch] gh/kurtamohler/60/base -> origin/gh/kurtamohler/60/base 2025-12-04T08:54:02.2730655Z * [new branch] gh/kurtamohler/60/head -> origin/gh/kurtamohler/60/head 2025-12-04T08:54:02.2730730Z * [new branch] gh/kurtamohler/60/orig -> origin/gh/kurtamohler/60/orig 2025-12-04T08:54:02.2730805Z * [new branch] gh/kurtamohler/61/base -> origin/gh/kurtamohler/61/base 2025-12-04T08:54:02.2730878Z * [new branch] gh/kurtamohler/61/head -> origin/gh/kurtamohler/61/head 2025-12-04T08:54:02.2730950Z * [new branch] gh/kurtamohler/61/orig -> origin/gh/kurtamohler/61/orig 2025-12-04T08:54:02.2731024Z * [new branch] gh/kurtamohler/62/base -> origin/gh/kurtamohler/62/base 2025-12-04T08:54:02.2731097Z * [new branch] gh/kurtamohler/62/head -> origin/gh/kurtamohler/62/head 2025-12-04T08:54:02.2731170Z * [new branch] gh/kurtamohler/62/orig -> origin/gh/kurtamohler/62/orig 2025-12-04T08:54:02.2731242Z * [new branch] gh/kurtamohler/63/base -> origin/gh/kurtamohler/63/base 2025-12-04T08:54:02.2731316Z * [new branch] gh/kurtamohler/63/head -> origin/gh/kurtamohler/63/head 2025-12-04T08:54:02.2731390Z * [new branch] gh/kurtamohler/63/orig -> origin/gh/kurtamohler/63/orig 2025-12-04T08:54:02.2731462Z * [new branch] gh/kurtamohler/64/base -> origin/gh/kurtamohler/64/base 2025-12-04T08:54:02.2731535Z * [new branch] gh/kurtamohler/64/head -> origin/gh/kurtamohler/64/head 2025-12-04T08:54:02.2731608Z * [new branch] gh/kurtamohler/64/orig -> origin/gh/kurtamohler/64/orig 2025-12-04T08:54:02.2731680Z * [new branch] gh/kurtamohler/65/base -> origin/gh/kurtamohler/65/base 2025-12-04T08:54:02.2731753Z * [new branch] gh/kurtamohler/65/head -> origin/gh/kurtamohler/65/head 2025-12-04T08:54:02.2731829Z * [new branch] gh/kurtamohler/65/orig -> origin/gh/kurtamohler/65/orig 2025-12-04T08:54:02.2731901Z * [new branch] gh/kurtamohler/66/base -> origin/gh/kurtamohler/66/base 2025-12-04T08:54:02.2731975Z * [new branch] gh/kurtamohler/66/head -> origin/gh/kurtamohler/66/head 2025-12-04T08:54:02.2732049Z * [new branch] gh/kurtamohler/66/orig -> origin/gh/kurtamohler/66/orig 2025-12-04T08:54:02.2732122Z * [new branch] gh/kurtamohler/67/base -> origin/gh/kurtamohler/67/base 2025-12-04T08:54:02.2732194Z * [new branch] gh/kurtamohler/67/head -> origin/gh/kurtamohler/67/head 2025-12-04T08:54:02.2732268Z * [new branch] gh/kurtamohler/67/orig -> origin/gh/kurtamohler/67/orig 2025-12-04T08:54:02.2732381Z * [new branch] gh/kwen2501/130/base -> origin/gh/kwen2501/130/base 2025-12-04T08:54:02.2732450Z * [new branch] gh/kwen2501/130/head -> origin/gh/kwen2501/130/head 2025-12-04T08:54:02.2732519Z * [new branch] gh/kwen2501/130/orig -> origin/gh/kwen2501/130/orig 2025-12-04T08:54:02.2732587Z * [new branch] gh/kwen2501/170/base -> origin/gh/kwen2501/170/base 2025-12-04T08:54:02.2732657Z * [new branch] gh/kwen2501/170/head -> origin/gh/kwen2501/170/head 2025-12-04T08:54:02.2732725Z * [new branch] gh/kwen2501/187/base -> origin/gh/kwen2501/187/base 2025-12-04T08:54:02.2732792Z * [new branch] gh/kwen2501/187/head -> origin/gh/kwen2501/187/head 2025-12-04T08:54:02.2732860Z * [new branch] gh/kwen2501/187/orig -> origin/gh/kwen2501/187/orig 2025-12-04T08:54:02.2732927Z * [new branch] gh/kwen2501/188/base -> origin/gh/kwen2501/188/base 2025-12-04T08:54:02.2732996Z * [new branch] gh/kwen2501/188/head -> origin/gh/kwen2501/188/head 2025-12-04T08:54:02.2733064Z * [new branch] gh/kwen2501/188/orig -> origin/gh/kwen2501/188/orig 2025-12-04T08:54:02.2733131Z * [new branch] gh/kwen2501/211/base -> origin/gh/kwen2501/211/base 2025-12-04T08:54:02.2733198Z * [new branch] gh/kwen2501/211/head -> origin/gh/kwen2501/211/head 2025-12-04T08:54:02.2733295Z * [new branch] gh/kwen2501/224/base -> origin/gh/kwen2501/224/base 2025-12-04T08:54:02.2733363Z * [new branch] gh/kwen2501/224/head -> origin/gh/kwen2501/224/head 2025-12-04T08:54:02.2733430Z * [new branch] gh/kwen2501/224/orig -> origin/gh/kwen2501/224/orig 2025-12-04T08:54:02.2733498Z * [new branch] gh/kwen2501/228/base -> origin/gh/kwen2501/228/base 2025-12-04T08:54:02.2733565Z * [new branch] gh/kwen2501/228/head -> origin/gh/kwen2501/228/head 2025-12-04T08:54:02.2733633Z * [new branch] gh/kwen2501/228/orig -> origin/gh/kwen2501/228/orig 2025-12-04T08:54:02.2733702Z * [new branch] gh/kwen2501/234/base -> origin/gh/kwen2501/234/base 2025-12-04T08:54:02.2733769Z * [new branch] gh/kwen2501/234/head -> origin/gh/kwen2501/234/head 2025-12-04T08:54:02.2733837Z * [new branch] gh/kwen2501/234/orig -> origin/gh/kwen2501/234/orig 2025-12-04T08:54:02.2733905Z * [new branch] gh/kwen2501/235/base -> origin/gh/kwen2501/235/base 2025-12-04T08:54:02.2733972Z * [new branch] gh/kwen2501/235/head -> origin/gh/kwen2501/235/head 2025-12-04T08:54:02.2734039Z * [new branch] gh/kwen2501/235/orig -> origin/gh/kwen2501/235/orig 2025-12-04T08:54:02.2734107Z * [new branch] gh/kwen2501/236/base -> origin/gh/kwen2501/236/base 2025-12-04T08:54:02.2734174Z * [new branch] gh/kwen2501/236/head -> origin/gh/kwen2501/236/head 2025-12-04T08:54:02.2734244Z * [new branch] gh/kwen2501/236/orig -> origin/gh/kwen2501/236/orig 2025-12-04T08:54:02.2734310Z * [new branch] gh/kwen2501/237/base -> origin/gh/kwen2501/237/base 2025-12-04T08:54:02.2734377Z * [new branch] gh/kwen2501/237/head -> origin/gh/kwen2501/237/head 2025-12-04T08:54:02.2734448Z * [new branch] gh/kwen2501/237/orig -> origin/gh/kwen2501/237/orig 2025-12-04T08:54:02.2734516Z * [new branch] gh/kwen2501/238/base -> origin/gh/kwen2501/238/base 2025-12-04T08:54:02.2734584Z * [new branch] gh/kwen2501/238/head -> origin/gh/kwen2501/238/head 2025-12-04T08:54:02.2734652Z * [new branch] gh/kwen2501/238/orig -> origin/gh/kwen2501/238/orig 2025-12-04T08:54:02.2734719Z * [new branch] gh/kwen2501/240/base -> origin/gh/kwen2501/240/base 2025-12-04T08:54:02.2734786Z * [new branch] gh/kwen2501/240/head -> origin/gh/kwen2501/240/head 2025-12-04T08:54:02.2734882Z * [new branch] gh/kwen2501/240/orig -> origin/gh/kwen2501/240/orig 2025-12-04T08:54:02.2734950Z * [new branch] gh/kwen2501/241/base -> origin/gh/kwen2501/241/base 2025-12-04T08:54:02.2735018Z * [new branch] gh/kwen2501/241/head -> origin/gh/kwen2501/241/head 2025-12-04T08:54:02.2735088Z * [new branch] gh/kwen2501/241/orig -> origin/gh/kwen2501/241/orig 2025-12-04T08:54:02.2735156Z * [new branch] gh/kwen2501/247/base -> origin/gh/kwen2501/247/base 2025-12-04T08:54:02.2735223Z * [new branch] gh/kwen2501/247/head -> origin/gh/kwen2501/247/head 2025-12-04T08:54:02.2735293Z * [new branch] gh/kwen2501/247/orig -> origin/gh/kwen2501/247/orig 2025-12-04T08:54:02.2735360Z * [new branch] gh/kwen2501/252/base -> origin/gh/kwen2501/252/base 2025-12-04T08:54:02.2735427Z * [new branch] gh/kwen2501/252/head -> origin/gh/kwen2501/252/head 2025-12-04T08:54:02.2735497Z * [new branch] gh/kwen2501/252/orig -> origin/gh/kwen2501/252/orig 2025-12-04T08:54:02.2735564Z * [new branch] gh/kwen2501/259/base -> origin/gh/kwen2501/259/base 2025-12-04T08:54:02.2735633Z * [new branch] gh/kwen2501/259/head -> origin/gh/kwen2501/259/head 2025-12-04T08:54:02.2735733Z * [new branch] gh/kwen2501/259/orig -> origin/gh/kwen2501/259/orig 2025-12-04T08:54:02.2735800Z * [new branch] gh/kwen2501/260/base -> origin/gh/kwen2501/260/base 2025-12-04T08:54:02.2735869Z * [new branch] gh/kwen2501/260/head -> origin/gh/kwen2501/260/head 2025-12-04T08:54:02.2735982Z * [new branch] gh/kwen2501/260/orig -> origin/gh/kwen2501/260/orig 2025-12-04T08:54:02.2736050Z * [new branch] gh/kwen2501/268/base -> origin/gh/kwen2501/268/base 2025-12-04T08:54:02.2736118Z * [new branch] gh/kwen2501/268/head -> origin/gh/kwen2501/268/head 2025-12-04T08:54:02.2736187Z * [new branch] gh/kwen2501/268/orig -> origin/gh/kwen2501/268/orig 2025-12-04T08:54:02.2736254Z * [new branch] gh/kwen2501/269/base -> origin/gh/kwen2501/269/base 2025-12-04T08:54:02.2736324Z * [new branch] gh/kwen2501/269/head -> origin/gh/kwen2501/269/head 2025-12-04T08:54:02.2736397Z * [new branch] gh/kwen2501/269/orig -> origin/gh/kwen2501/269/orig 2025-12-04T08:54:02.2736464Z * [new branch] gh/kwen2501/270/base -> origin/gh/kwen2501/270/base 2025-12-04T08:54:02.2736533Z * [new branch] gh/kwen2501/270/head -> origin/gh/kwen2501/270/head 2025-12-04T08:54:02.2736600Z * [new branch] gh/kwen2501/270/orig -> origin/gh/kwen2501/270/orig 2025-12-04T08:54:02.2736668Z * [new branch] gh/kwen2501/271/base -> origin/gh/kwen2501/271/base 2025-12-04T08:54:02.2736736Z * [new branch] gh/kwen2501/271/head -> origin/gh/kwen2501/271/head 2025-12-04T08:54:02.2736805Z * [new branch] gh/kwen2501/271/orig -> origin/gh/kwen2501/271/orig 2025-12-04T08:54:02.2736872Z * [new branch] gh/kwen2501/274/base -> origin/gh/kwen2501/274/base 2025-12-04T08:54:02.2736940Z * [new branch] gh/kwen2501/274/head -> origin/gh/kwen2501/274/head 2025-12-04T08:54:02.2737008Z * [new branch] gh/kwen2501/274/orig -> origin/gh/kwen2501/274/orig 2025-12-04T08:54:02.2737077Z * [new branch] gh/kwen2501/275/base -> origin/gh/kwen2501/275/base 2025-12-04T08:54:02.2737144Z * [new branch] gh/kwen2501/275/head -> origin/gh/kwen2501/275/head 2025-12-04T08:54:02.2737210Z * [new branch] gh/kwen2501/275/orig -> origin/gh/kwen2501/275/orig 2025-12-04T08:54:02.2737279Z * [new branch] gh/kwen2501/276/base -> origin/gh/kwen2501/276/base 2025-12-04T08:54:02.2737346Z * [new branch] gh/kwen2501/276/head -> origin/gh/kwen2501/276/head 2025-12-04T08:54:02.2737473Z * [new branch] gh/kwen2501/276/orig -> origin/gh/kwen2501/276/orig 2025-12-04T08:54:02.2737542Z * [new branch] gh/kwen2501/277/base -> origin/gh/kwen2501/277/base 2025-12-04T08:54:02.2737609Z * [new branch] gh/kwen2501/277/head -> origin/gh/kwen2501/277/head 2025-12-04T08:54:02.2737678Z * [new branch] gh/kwen2501/277/orig -> origin/gh/kwen2501/277/orig 2025-12-04T08:54:02.2737747Z * [new branch] gh/kwen2501/278/base -> origin/gh/kwen2501/278/base 2025-12-04T08:54:02.2737814Z * [new branch] gh/kwen2501/278/head -> origin/gh/kwen2501/278/head 2025-12-04T08:54:02.2737882Z * [new branch] gh/kwen2501/278/orig -> origin/gh/kwen2501/278/orig 2025-12-04T08:54:02.2737950Z * [new branch] gh/kwen2501/279/base -> origin/gh/kwen2501/279/base 2025-12-04T08:54:02.2738019Z * [new branch] gh/kwen2501/279/head -> origin/gh/kwen2501/279/head 2025-12-04T08:54:02.2738087Z * [new branch] gh/kwen2501/279/orig -> origin/gh/kwen2501/279/orig 2025-12-04T08:54:02.2738156Z * [new branch] gh/kwen2501/280/base -> origin/gh/kwen2501/280/base 2025-12-04T08:54:02.2738223Z * [new branch] gh/kwen2501/280/head -> origin/gh/kwen2501/280/head 2025-12-04T08:54:02.2738326Z * [new branch] gh/kwen2501/280/orig -> origin/gh/kwen2501/280/orig 2025-12-04T08:54:02.2738396Z * [new branch] gh/kwen2501/281/base -> origin/gh/kwen2501/281/base 2025-12-04T08:54:02.2738463Z * [new branch] gh/kwen2501/281/head -> origin/gh/kwen2501/281/head 2025-12-04T08:54:02.2738532Z * [new branch] gh/kwen2501/281/orig -> origin/gh/kwen2501/281/orig 2025-12-04T08:54:02.2738599Z * [new branch] gh/kwen2501/282/base -> origin/gh/kwen2501/282/base 2025-12-04T08:54:02.2738667Z * [new branch] gh/kwen2501/282/head -> origin/gh/kwen2501/282/head 2025-12-04T08:54:02.2738735Z * [new branch] gh/kwen2501/282/orig -> origin/gh/kwen2501/282/orig 2025-12-04T08:54:02.2738802Z * [new branch] gh/kwen2501/283/base -> origin/gh/kwen2501/283/base 2025-12-04T08:54:02.2738869Z * [new branch] gh/kwen2501/283/head -> origin/gh/kwen2501/283/head 2025-12-04T08:54:02.2738938Z * [new branch] gh/kwen2501/283/orig -> origin/gh/kwen2501/283/orig 2025-12-04T08:54:02.2739005Z * [new branch] gh/kwen2501/284/base -> origin/gh/kwen2501/284/base 2025-12-04T08:54:02.2739071Z * [new branch] gh/kwen2501/284/head -> origin/gh/kwen2501/284/head 2025-12-04T08:54:02.2739139Z * [new branch] gh/kwen2501/284/orig -> origin/gh/kwen2501/284/orig 2025-12-04T08:54:02.2739207Z * [new branch] gh/kwen2501/285/base -> origin/gh/kwen2501/285/base 2025-12-04T08:54:02.2739276Z * [new branch] gh/kwen2501/285/head -> origin/gh/kwen2501/285/head 2025-12-04T08:54:02.2739346Z * [new branch] gh/kwen2501/285/orig -> origin/gh/kwen2501/285/orig 2025-12-04T08:54:02.2739413Z * [new branch] gh/kwen2501/286/base -> origin/gh/kwen2501/286/base 2025-12-04T08:54:02.2739480Z * [new branch] gh/kwen2501/286/head -> origin/gh/kwen2501/286/head 2025-12-04T08:54:02.2739549Z * [new branch] gh/kwen2501/286/orig -> origin/gh/kwen2501/286/orig 2025-12-04T08:54:02.2739617Z * [new branch] gh/kwen2501/287/base -> origin/gh/kwen2501/287/base 2025-12-04T08:54:02.2739684Z * [new branch] gh/kwen2501/287/head -> origin/gh/kwen2501/287/head 2025-12-04T08:54:02.2739752Z * [new branch] gh/kwen2501/287/orig -> origin/gh/kwen2501/287/orig 2025-12-04T08:54:02.2739820Z * [new branch] gh/kwen2501/288/base -> origin/gh/kwen2501/288/base 2025-12-04T08:54:02.2739913Z * [new branch] gh/kwen2501/288/head -> origin/gh/kwen2501/288/head 2025-12-04T08:54:02.2739981Z * [new branch] gh/kwen2501/288/orig -> origin/gh/kwen2501/288/orig 2025-12-04T08:54:02.2740055Z * [new branch] gh/laithsakka/251/base -> origin/gh/laithsakka/251/base 2025-12-04T08:54:02.2740131Z * [new branch] gh/laithsakka/251/head -> origin/gh/laithsakka/251/head 2025-12-04T08:54:02.2740204Z * [new branch] gh/laithsakka/251/orig -> origin/gh/laithsakka/251/orig 2025-12-04T08:54:02.2740276Z * [new branch] gh/laithsakka/276/base -> origin/gh/laithsakka/276/base 2025-12-04T08:54:02.2740349Z * [new branch] gh/laithsakka/276/head -> origin/gh/laithsakka/276/head 2025-12-04T08:54:02.2740420Z * [new branch] gh/laithsakka/276/orig -> origin/gh/laithsakka/276/orig 2025-12-04T08:54:02.2740493Z * [new branch] gh/laithsakka/28/base -> origin/gh/laithsakka/28/base 2025-12-04T08:54:02.2740569Z * [new branch] gh/laithsakka/29/base -> origin/gh/laithsakka/29/base 2025-12-04T08:54:02.2740641Z * [new branch] gh/laithsakka/30/base -> origin/gh/laithsakka/30/base 2025-12-04T08:54:02.2740713Z * [new branch] gh/laithsakka/30/head -> origin/gh/laithsakka/30/head 2025-12-04T08:54:02.2740812Z * [new branch] gh/laithsakka/31/base -> origin/gh/laithsakka/31/base 2025-12-04T08:54:02.2740884Z * [new branch] gh/laithsakka/31/head -> origin/gh/laithsakka/31/head 2025-12-04T08:54:02.2740956Z * [new branch] gh/laithsakka/313/base -> origin/gh/laithsakka/313/base 2025-12-04T08:54:02.2741031Z * [new branch] gh/laithsakka/313/head -> origin/gh/laithsakka/313/head 2025-12-04T08:54:02.2741104Z * [new branch] gh/laithsakka/313/orig -> origin/gh/laithsakka/313/orig 2025-12-04T08:54:02.2741175Z * [new branch] gh/laithsakka/316/base -> origin/gh/laithsakka/316/base 2025-12-04T08:54:02.2741249Z * [new branch] gh/laithsakka/316/head -> origin/gh/laithsakka/316/head 2025-12-04T08:54:02.2741321Z * [new branch] gh/laithsakka/316/orig -> origin/gh/laithsakka/316/orig 2025-12-04T08:54:02.2741392Z * [new branch] gh/laithsakka/317/base -> origin/gh/laithsakka/317/base 2025-12-04T08:54:02.2741465Z * [new branch] gh/laithsakka/317/head -> origin/gh/laithsakka/317/head 2025-12-04T08:54:02.2741537Z * [new branch] gh/laithsakka/317/orig -> origin/gh/laithsakka/317/orig 2025-12-04T08:54:02.2741610Z * [new branch] gh/laithsakka/319/base -> origin/gh/laithsakka/319/base 2025-12-04T08:54:02.2741681Z * [new branch] gh/laithsakka/319/head -> origin/gh/laithsakka/319/head 2025-12-04T08:54:02.2741752Z * [new branch] gh/laithsakka/319/orig -> origin/gh/laithsakka/319/orig 2025-12-04T08:54:02.2741827Z * [new branch] gh/laithsakka/32/base -> origin/gh/laithsakka/32/base 2025-12-04T08:54:02.2741901Z * [new branch] gh/laithsakka/32/head -> origin/gh/laithsakka/32/head 2025-12-04T08:54:02.2741973Z * [new branch] gh/laithsakka/320/base -> origin/gh/laithsakka/320/base 2025-12-04T08:54:02.2742045Z * [new branch] gh/laithsakka/320/head -> origin/gh/laithsakka/320/head 2025-12-04T08:54:02.2742119Z * [new branch] gh/laithsakka/320/orig -> origin/gh/laithsakka/320/orig 2025-12-04T08:54:02.2742190Z * [new branch] gh/laithsakka/321/base -> origin/gh/laithsakka/321/base 2025-12-04T08:54:02.2742262Z * [new branch] gh/laithsakka/321/head -> origin/gh/laithsakka/321/head 2025-12-04T08:54:02.2742334Z * [new branch] gh/laithsakka/321/orig -> origin/gh/laithsakka/321/orig 2025-12-04T08:54:02.2742406Z * [new branch] gh/laithsakka/322/base -> origin/gh/laithsakka/322/base 2025-12-04T08:54:02.2742509Z * [new branch] gh/laithsakka/322/head -> origin/gh/laithsakka/322/head 2025-12-04T08:54:02.2742580Z * [new branch] gh/laithsakka/322/orig -> origin/gh/laithsakka/322/orig 2025-12-04T08:54:02.2742651Z * [new branch] gh/laithsakka/323/base -> origin/gh/laithsakka/323/base 2025-12-04T08:54:02.2742723Z * [new branch] gh/laithsakka/323/head -> origin/gh/laithsakka/323/head 2025-12-04T08:54:02.2742795Z * [new branch] gh/laithsakka/323/orig -> origin/gh/laithsakka/323/orig 2025-12-04T08:54:02.2742867Z * [new branch] gh/laithsakka/324/base -> origin/gh/laithsakka/324/base 2025-12-04T08:54:02.2742940Z * [new branch] gh/laithsakka/324/head -> origin/gh/laithsakka/324/head 2025-12-04T08:54:02.2743011Z * [new branch] gh/laithsakka/324/orig -> origin/gh/laithsakka/324/orig 2025-12-04T08:54:02.2743083Z * [new branch] gh/laithsakka/325/base -> origin/gh/laithsakka/325/base 2025-12-04T08:54:02.2743156Z * [new branch] gh/laithsakka/325/head -> origin/gh/laithsakka/325/head 2025-12-04T08:54:02.2743228Z * [new branch] gh/laithsakka/325/orig -> origin/gh/laithsakka/325/orig 2025-12-04T08:54:02.2743300Z * [new branch] gh/laithsakka/326/base -> origin/gh/laithsakka/326/base 2025-12-04T08:54:02.2743399Z * [new branch] gh/laithsakka/326/head -> origin/gh/laithsakka/326/head 2025-12-04T08:54:02.2743470Z * [new branch] gh/laithsakka/326/orig -> origin/gh/laithsakka/326/orig 2025-12-04T08:54:02.2743542Z * [new branch] gh/laithsakka/327/base -> origin/gh/laithsakka/327/base 2025-12-04T08:54:02.2743613Z * [new branch] gh/laithsakka/327/head -> origin/gh/laithsakka/327/head 2025-12-04T08:54:02.2743685Z * [new branch] gh/laithsakka/327/orig -> origin/gh/laithsakka/327/orig 2025-12-04T08:54:02.2743758Z * [new branch] gh/laithsakka/328/base -> origin/gh/laithsakka/328/base 2025-12-04T08:54:02.2743831Z * [new branch] gh/laithsakka/328/head -> origin/gh/laithsakka/328/head 2025-12-04T08:54:02.2743903Z * [new branch] gh/laithsakka/328/orig -> origin/gh/laithsakka/328/orig 2025-12-04T08:54:02.2743972Z * [new branch] gh/liangel/4/base -> origin/gh/liangel/4/base 2025-12-04T08:54:02.2744043Z * [new branch] gh/liangel/4/head -> origin/gh/liangel/4/head 2025-12-04T08:54:02.2744111Z * [new branch] gh/liangel/4/orig -> origin/gh/liangel/4/orig 2025-12-04T08:54:02.2744188Z * [new branch] gh/lucaskabela/1/base -> origin/gh/lucaskabela/1/base 2025-12-04T08:54:02.2744262Z * [new branch] gh/lucaskabela/1/head -> origin/gh/lucaskabela/1/head 2025-12-04T08:54:02.2744325Z * [new branch] gh/lw/4/base -> origin/gh/lw/4/base 2025-12-04T08:54:02.2744388Z * [new branch] gh/lw/4/head -> origin/gh/lw/4/head 2025-12-04T08:54:02.2744450Z * [new branch] gh/lw/4/orig -> origin/gh/lw/4/orig 2025-12-04T08:54:02.2744509Z * [new branch] gh/lw/5/base -> origin/gh/lw/5/base 2025-12-04T08:54:02.2744569Z * [new branch] gh/lw/5/head -> origin/gh/lw/5/head 2025-12-04T08:54:02.2744631Z * [new branch] gh/lw/5/orig -> origin/gh/lw/5/orig 2025-12-04T08:54:02.2744691Z * [new branch] gh/lw/6/base -> origin/gh/lw/6/base 2025-12-04T08:54:02.2744750Z * [new branch] gh/lw/6/head -> origin/gh/lw/6/head 2025-12-04T08:54:02.2744809Z * [new branch] gh/lw/6/orig -> origin/gh/lw/6/orig 2025-12-04T08:54:02.2744877Z * [new branch] gh/malfet/14/base -> origin/gh/malfet/14/base 2025-12-04T08:54:02.2744947Z * [new branch] gh/malfet/417/base -> origin/gh/malfet/417/base 2025-12-04T08:54:02.2745048Z * [new branch] gh/malfet/417/head -> origin/gh/malfet/417/head 2025-12-04T08:54:02.2745116Z * [new branch] gh/malfet/417/orig -> origin/gh/malfet/417/orig 2025-12-04T08:54:02.2745184Z * [new branch] gh/malfet/506/base -> origin/gh/malfet/506/base 2025-12-04T08:54:02.2745249Z * [new branch] gh/malfet/506/head -> origin/gh/malfet/506/head 2025-12-04T08:54:02.2745319Z * [new branch] gh/malfet/506/orig -> origin/gh/malfet/506/orig 2025-12-04T08:54:02.2745385Z * [new branch] gh/malfet/517/base -> origin/gh/malfet/517/base 2025-12-04T08:54:02.2745450Z * [new branch] gh/malfet/517/head -> origin/gh/malfet/517/head 2025-12-04T08:54:02.2745517Z * [new branch] gh/malfet/528/base -> origin/gh/malfet/528/base 2025-12-04T08:54:02.2745583Z * [new branch] gh/malfet/528/head -> origin/gh/malfet/528/head 2025-12-04T08:54:02.2745650Z * [new branch] gh/malfet/528/orig -> origin/gh/malfet/528/orig 2025-12-04T08:54:02.2745716Z * [new branch] gh/malfet/537/base -> origin/gh/malfet/537/base 2025-12-04T08:54:02.2745782Z * [new branch] gh/malfet/537/head -> origin/gh/malfet/537/head 2025-12-04T08:54:02.2745847Z * [new branch] gh/malfet/537/orig -> origin/gh/malfet/537/orig 2025-12-04T08:54:02.2745981Z * [new branch] gh/malfet/546/base -> origin/gh/malfet/546/base 2025-12-04T08:54:02.2746047Z * [new branch] gh/malfet/546/head -> origin/gh/malfet/546/head 2025-12-04T08:54:02.2746112Z * [new branch] gh/malfet/546/orig -> origin/gh/malfet/546/orig 2025-12-04T08:54:02.2746179Z * [new branch] gh/malfet/565/base -> origin/gh/malfet/565/base 2025-12-04T08:54:02.2746244Z * [new branch] gh/malfet/565/head -> origin/gh/malfet/565/head 2025-12-04T08:54:02.2746314Z * [new branch] gh/malfet/565/orig -> origin/gh/malfet/565/orig 2025-12-04T08:54:02.2746380Z * [new branch] gh/malfet/575/base -> origin/gh/malfet/575/base 2025-12-04T08:54:02.2746446Z * [new branch] gh/malfet/575/head -> origin/gh/malfet/575/head 2025-12-04T08:54:02.2746513Z * [new branch] gh/malfet/575/orig -> origin/gh/malfet/575/orig 2025-12-04T08:54:02.2746580Z * [new branch] gh/malfet/580/base -> origin/gh/malfet/580/base 2025-12-04T08:54:02.2746646Z * [new branch] gh/malfet/580/head -> origin/gh/malfet/580/head 2025-12-04T08:54:02.2746713Z * [new branch] gh/malfet/580/orig -> origin/gh/malfet/580/orig 2025-12-04T08:54:02.2746779Z * [new branch] gh/malfet/581/base -> origin/gh/malfet/581/base 2025-12-04T08:54:02.2746845Z * [new branch] gh/malfet/581/head -> origin/gh/malfet/581/head 2025-12-04T08:54:02.2746913Z * [new branch] gh/malfet/581/orig -> origin/gh/malfet/581/orig 2025-12-04T08:54:02.2746979Z * [new branch] gh/malfet/583/base -> origin/gh/malfet/583/base 2025-12-04T08:54:02.2747044Z * [new branch] gh/malfet/583/head -> origin/gh/malfet/583/head 2025-12-04T08:54:02.2747111Z * [new branch] gh/malfet/583/orig -> origin/gh/malfet/583/orig 2025-12-04T08:54:02.2747178Z * [new branch] gh/malfet/586/base -> origin/gh/malfet/586/base 2025-12-04T08:54:02.2747243Z * [new branch] gh/malfet/586/head -> origin/gh/malfet/586/head 2025-12-04T08:54:02.2747310Z * [new branch] gh/malfet/586/orig -> origin/gh/malfet/586/orig 2025-12-04T08:54:02.2747375Z * [new branch] gh/malfet/587/base -> origin/gh/malfet/587/base 2025-12-04T08:54:02.2747440Z * [new branch] gh/malfet/587/head -> origin/gh/malfet/587/head 2025-12-04T08:54:02.2747557Z * [new branch] gh/malfet/587/orig -> origin/gh/malfet/587/orig 2025-12-04T08:54:02.2747622Z * [new branch] gh/malfet/588/base -> origin/gh/malfet/588/base 2025-12-04T08:54:02.2747688Z * [new branch] gh/malfet/588/head -> origin/gh/malfet/588/head 2025-12-04T08:54:02.2747755Z * [new branch] gh/malfet/588/orig -> origin/gh/malfet/588/orig 2025-12-04T08:54:02.2747821Z * [new branch] gh/malfet/589/base -> origin/gh/malfet/589/base 2025-12-04T08:54:02.2747888Z * [new branch] gh/malfet/589/head -> origin/gh/malfet/589/head 2025-12-04T08:54:02.2747956Z * [new branch] gh/malfet/589/orig -> origin/gh/malfet/589/orig 2025-12-04T08:54:02.2748022Z * [new branch] gh/malfet/590/base -> origin/gh/malfet/590/base 2025-12-04T08:54:02.2748090Z * [new branch] gh/malfet/590/head -> origin/gh/malfet/590/head 2025-12-04T08:54:02.2748155Z * [new branch] gh/malfet/590/orig -> origin/gh/malfet/590/orig 2025-12-04T08:54:02.2748222Z * [new branch] gh/malfet/591/base -> origin/gh/malfet/591/base 2025-12-04T08:54:02.2748289Z * [new branch] gh/malfet/591/head -> origin/gh/malfet/591/head 2025-12-04T08:54:02.2748354Z * [new branch] gh/malfet/591/orig -> origin/gh/malfet/591/orig 2025-12-04T08:54:02.2748456Z * [new branch] gh/malfet/592/base -> origin/gh/malfet/592/base 2025-12-04T08:54:02.2748523Z * [new branch] gh/malfet/592/head -> origin/gh/malfet/592/head 2025-12-04T08:54:02.2748588Z * [new branch] gh/malfet/592/orig -> origin/gh/malfet/592/orig 2025-12-04T08:54:02.2748654Z * [new branch] gh/malfet/593/base -> origin/gh/malfet/593/base 2025-12-04T08:54:02.2748720Z * [new branch] gh/malfet/593/head -> origin/gh/malfet/593/head 2025-12-04T08:54:02.2748785Z * [new branch] gh/malfet/593/orig -> origin/gh/malfet/593/orig 2025-12-04T08:54:02.2748851Z * [new branch] gh/malfet/594/base -> origin/gh/malfet/594/base 2025-12-04T08:54:02.2748920Z * [new branch] gh/malfet/594/head -> origin/gh/malfet/594/head 2025-12-04T08:54:02.2748986Z * [new branch] gh/malfet/594/orig -> origin/gh/malfet/594/orig 2025-12-04T08:54:02.2749052Z * [new branch] gh/malfet/595/base -> origin/gh/malfet/595/base 2025-12-04T08:54:02.2749118Z * [new branch] gh/malfet/595/head -> origin/gh/malfet/595/head 2025-12-04T08:54:02.2749184Z * [new branch] gh/malfet/595/orig -> origin/gh/malfet/595/orig 2025-12-04T08:54:02.2749249Z * [new branch] gh/malfet/596/base -> origin/gh/malfet/596/base 2025-12-04T08:54:02.2749316Z * [new branch] gh/malfet/596/head -> origin/gh/malfet/596/head 2025-12-04T08:54:02.2749382Z * [new branch] gh/malfet/596/orig -> origin/gh/malfet/596/orig 2025-12-04T08:54:02.2749449Z * [new branch] gh/malfet/597/base -> origin/gh/malfet/597/base 2025-12-04T08:54:02.2749516Z * [new branch] gh/malfet/597/head -> origin/gh/malfet/597/head 2025-12-04T08:54:02.2749581Z * [new branch] gh/malfet/597/orig -> origin/gh/malfet/597/orig 2025-12-04T08:54:02.2749650Z * [new branch] gh/malfet/598/base -> origin/gh/malfet/598/base 2025-12-04T08:54:02.2749716Z * [new branch] gh/malfet/598/head -> origin/gh/malfet/598/head 2025-12-04T08:54:02.2749782Z * [new branch] gh/malfet/598/orig -> origin/gh/malfet/598/orig 2025-12-04T08:54:02.2749847Z * [new branch] gh/malfet/599/base -> origin/gh/malfet/599/base 2025-12-04T08:54:02.2749912Z * [new branch] gh/malfet/599/head -> origin/gh/malfet/599/head 2025-12-04T08:54:02.2749977Z * [new branch] gh/malfet/599/orig -> origin/gh/malfet/599/orig 2025-12-04T08:54:02.2750077Z * [new branch] gh/malfet/600/base -> origin/gh/malfet/600/base 2025-12-04T08:54:02.2750143Z * [new branch] gh/malfet/600/head -> origin/gh/malfet/600/head 2025-12-04T08:54:02.2750208Z * [new branch] gh/malfet/600/orig -> origin/gh/malfet/600/orig 2025-12-04T08:54:02.2750277Z * [new branch] gh/malfet/601/base -> origin/gh/malfet/601/base 2025-12-04T08:54:02.2750343Z * [new branch] gh/malfet/601/head -> origin/gh/malfet/601/head 2025-12-04T08:54:02.2750408Z * [new branch] gh/malfet/601/orig -> origin/gh/malfet/601/orig 2025-12-04T08:54:02.2750475Z * [new branch] gh/malfet/602/base -> origin/gh/malfet/602/base 2025-12-04T08:54:02.2750541Z * [new branch] gh/malfet/602/head -> origin/gh/malfet/602/head 2025-12-04T08:54:02.2750606Z * [new branch] gh/malfet/602/orig -> origin/gh/malfet/602/orig 2025-12-04T08:54:02.2750675Z * [new branch] gh/malfet/603/base -> origin/gh/malfet/603/base 2025-12-04T08:54:02.2750740Z * [new branch] gh/malfet/603/head -> origin/gh/malfet/603/head 2025-12-04T08:54:02.2750806Z * [new branch] gh/malfet/603/orig -> origin/gh/malfet/603/orig 2025-12-04T08:54:02.2750907Z * [new branch] gh/malfet/604/base -> origin/gh/malfet/604/base 2025-12-04T08:54:02.2750973Z * [new branch] gh/malfet/604/head -> origin/gh/malfet/604/head 2025-12-04T08:54:02.2751039Z * [new branch] gh/malfet/604/orig -> origin/gh/malfet/604/orig 2025-12-04T08:54:02.2751106Z * [new branch] gh/malfet/605/base -> origin/gh/malfet/605/base 2025-12-04T08:54:02.2751171Z * [new branch] gh/malfet/605/head -> origin/gh/malfet/605/head 2025-12-04T08:54:02.2751236Z * [new branch] gh/malfet/605/orig -> origin/gh/malfet/605/orig 2025-12-04T08:54:02.2751304Z * [new branch] gh/malfet/606/base -> origin/gh/malfet/606/base 2025-12-04T08:54:02.2751369Z * [new branch] gh/malfet/606/head -> origin/gh/malfet/606/head 2025-12-04T08:54:02.2751435Z * [new branch] gh/malfet/606/orig -> origin/gh/malfet/606/orig 2025-12-04T08:54:02.2751503Z * [new branch] gh/malfet/607/base -> origin/gh/malfet/607/base 2025-12-04T08:54:02.2751569Z * [new branch] gh/malfet/607/head -> origin/gh/malfet/607/head 2025-12-04T08:54:02.2751635Z * [new branch] gh/malfet/607/orig -> origin/gh/malfet/607/orig 2025-12-04T08:54:02.2751701Z * [new branch] gh/malfet/608/base -> origin/gh/malfet/608/base 2025-12-04T08:54:02.2751766Z * [new branch] gh/malfet/608/head -> origin/gh/malfet/608/head 2025-12-04T08:54:02.2751833Z * [new branch] gh/malfet/608/orig -> origin/gh/malfet/608/orig 2025-12-04T08:54:02.2751899Z * [new branch] gh/malfet/609/base -> origin/gh/malfet/609/base 2025-12-04T08:54:02.2751965Z * [new branch] gh/malfet/609/head -> origin/gh/malfet/609/head 2025-12-04T08:54:02.2752031Z * [new branch] gh/malfet/609/orig -> origin/gh/malfet/609/orig 2025-12-04T08:54:02.2752097Z * [new branch] gh/malfet/610/base -> origin/gh/malfet/610/base 2025-12-04T08:54:02.2752163Z * [new branch] gh/malfet/610/head -> origin/gh/malfet/610/head 2025-12-04T08:54:02.2752231Z * [new branch] gh/malfet/610/orig -> origin/gh/malfet/610/orig 2025-12-04T08:54:02.2752296Z * [new branch] gh/malfet/611/base -> origin/gh/malfet/611/base 2025-12-04T08:54:02.2752361Z * [new branch] gh/malfet/611/head -> origin/gh/malfet/611/head 2025-12-04T08:54:02.2752428Z * [new branch] gh/malfet/611/orig -> origin/gh/malfet/611/orig 2025-12-04T08:54:02.2752528Z * [new branch] gh/malfet/612/base -> origin/gh/malfet/612/base 2025-12-04T08:54:02.2752593Z * [new branch] gh/malfet/612/head -> origin/gh/malfet/612/head 2025-12-04T08:54:02.2752660Z * [new branch] gh/malfet/612/orig -> origin/gh/malfet/612/orig 2025-12-04T08:54:02.2752729Z * [new branch] gh/malfet/64/base -> origin/gh/malfet/64/base 2025-12-04T08:54:02.2752795Z * [new branch] gh/malfet/64/head -> origin/gh/malfet/64/head 2025-12-04T08:54:02.2752884Z * [new branch] gh/manuelcandales/11/base -> origin/gh/manuelcandales/11/base 2025-12-04T08:54:02.2752969Z * [new branch] gh/manuelcandales/11/head -> origin/gh/manuelcandales/11/head 2025-12-04T08:54:02.2753053Z * [new branch] gh/manuelcandales/11/orig -> origin/gh/manuelcandales/11/orig 2025-12-04T08:54:02.2753122Z * [new branch] gh/markkm/1/base -> origin/gh/markkm/1/base 2025-12-04T08:54:02.2753194Z * [new branch] gh/masnesral/1/base -> origin/gh/masnesral/1/base 2025-12-04T08:54:02.2753266Z * [new branch] gh/masnesral/1/head -> origin/gh/masnesral/1/head 2025-12-04T08:54:02.2753335Z * [new branch] gh/masnesral/1/orig -> origin/gh/masnesral/1/orig 2025-12-04T08:54:02.2753428Z * [new branch] gh/mhorowitz/0/base -> origin/gh/mhorowitz/0/base 2025-12-04T08:54:02.2753498Z * [new branch] gh/mhorowitz/0/head -> origin/gh/mhorowitz/0/head 2025-12-04T08:54:02.2753567Z * [new branch] gh/mhorowitz/1/base -> origin/gh/mhorowitz/1/base 2025-12-04T08:54:02.2753636Z * [new branch] gh/mhorowitz/1/head -> origin/gh/mhorowitz/1/head 2025-12-04T08:54:02.2753705Z * [new branch] gh/mhorowitz/2/base -> origin/gh/mhorowitz/2/base 2025-12-04T08:54:02.2753773Z * [new branch] gh/mhorowitz/2/head -> origin/gh/mhorowitz/2/head 2025-12-04T08:54:02.2753845Z * [new branch] gh/mhorowitz/3/base -> origin/gh/mhorowitz/3/base 2025-12-04T08:54:02.2753914Z * [new branch] gh/mhorowitz/3/head -> origin/gh/mhorowitz/3/head 2025-12-04T08:54:02.2753981Z * [new branch] gh/mhorowitz/4/base -> origin/gh/mhorowitz/4/base 2025-12-04T08:54:02.2754051Z * [new branch] gh/mhorowitz/4/head -> origin/gh/mhorowitz/4/head 2025-12-04T08:54:02.2754120Z * [new branch] gh/mhorowitz/5/base -> origin/gh/mhorowitz/5/base 2025-12-04T08:54:02.2754189Z * [new branch] gh/mhorowitz/5/head -> origin/gh/mhorowitz/5/head 2025-12-04T08:54:02.2754258Z * [new branch] gh/mhorowitz/6/base -> origin/gh/mhorowitz/6/base 2025-12-04T08:54:02.2754327Z * [new branch] gh/mhorowitz/6/head -> origin/gh/mhorowitz/6/head 2025-12-04T08:54:02.2754427Z * [new branch] gh/mikaylagawarecki/234/base -> origin/gh/mikaylagawarecki/234/base 2025-12-04T08:54:02.2754523Z * [new branch] gh/mikaylagawarecki/234/head -> origin/gh/mikaylagawarecki/234/head 2025-12-04T08:54:02.2754615Z * [new branch] gh/mikaylagawarecki/235/base -> origin/gh/mikaylagawarecki/235/base 2025-12-04T08:54:02.2754706Z * [new branch] gh/mikaylagawarecki/235/head -> origin/gh/mikaylagawarecki/235/head 2025-12-04T08:54:02.2754798Z * [new branch] gh/mikaylagawarecki/236/base -> origin/gh/mikaylagawarecki/236/base 2025-12-04T08:54:02.2754887Z * [new branch] gh/mikaylagawarecki/236/head -> origin/gh/mikaylagawarecki/236/head 2025-12-04T08:54:02.2754976Z * [new branch] gh/mikaylagawarecki/237/base -> origin/gh/mikaylagawarecki/237/base 2025-12-04T08:54:02.2755066Z * [new branch] gh/mikaylagawarecki/237/head -> origin/gh/mikaylagawarecki/237/head 2025-12-04T08:54:02.2755156Z * [new branch] gh/mikaylagawarecki/238/base -> origin/gh/mikaylagawarecki/238/base 2025-12-04T08:54:02.2755274Z * [new branch] gh/mikaylagawarecki/238/head -> origin/gh/mikaylagawarecki/238/head 2025-12-04T08:54:02.2755365Z * [new branch] gh/mikaylagawarecki/336/base -> origin/gh/mikaylagawarecki/336/base 2025-12-04T08:54:02.2755454Z * [new branch] gh/mikaylagawarecki/336/head -> origin/gh/mikaylagawarecki/336/head 2025-12-04T08:54:02.2755545Z * [new branch] gh/mikaylagawarecki/336/orig -> origin/gh/mikaylagawarecki/336/orig 2025-12-04T08:54:02.2755636Z * [new branch] gh/mikaylagawarecki/341/base -> origin/gh/mikaylagawarecki/341/base 2025-12-04T08:54:02.2755725Z * [new branch] gh/mikaylagawarecki/341/head -> origin/gh/mikaylagawarecki/341/head 2025-12-04T08:54:02.2755815Z * [new branch] gh/mikaylagawarecki/341/orig -> origin/gh/mikaylagawarecki/341/orig 2025-12-04T08:54:02.2755906Z * [new branch] gh/mikaylagawarecki/342/base -> origin/gh/mikaylagawarecki/342/base 2025-12-04T08:54:02.2756047Z * [new branch] gh/mikaylagawarecki/342/head -> origin/gh/mikaylagawarecki/342/head 2025-12-04T08:54:02.2756138Z * [new branch] gh/mikaylagawarecki/342/orig -> origin/gh/mikaylagawarecki/342/orig 2025-12-04T08:54:02.2756228Z * [new branch] gh/mikaylagawarecki/345/base -> origin/gh/mikaylagawarecki/345/base 2025-12-04T08:54:02.2756363Z * [new branch] gh/mikaylagawarecki/345/head -> origin/gh/mikaylagawarecki/345/head 2025-12-04T08:54:02.2756454Z * [new branch] gh/mikaylagawarecki/345/orig -> origin/gh/mikaylagawarecki/345/orig 2025-12-04T08:54:02.2756543Z * [new branch] gh/mikaylagawarecki/346/base -> origin/gh/mikaylagawarecki/346/base 2025-12-04T08:54:02.2756633Z * [new branch] gh/mikaylagawarecki/346/head -> origin/gh/mikaylagawarecki/346/head 2025-12-04T08:54:02.2756722Z * [new branch] gh/mikaylagawarecki/346/orig -> origin/gh/mikaylagawarecki/346/orig 2025-12-04T08:54:02.2756814Z * [new branch] gh/mikaylagawarecki/347/base -> origin/gh/mikaylagawarecki/347/base 2025-12-04T08:54:02.2756904Z * [new branch] gh/mikaylagawarecki/347/head -> origin/gh/mikaylagawarecki/347/head 2025-12-04T08:54:02.2756995Z * [new branch] gh/mikaylagawarecki/347/orig -> origin/gh/mikaylagawarecki/347/orig 2025-12-04T08:54:02.2757086Z * [new branch] gh/mikaylagawarecki/350/base -> origin/gh/mikaylagawarecki/350/base 2025-12-04T08:54:02.2757176Z * [new branch] gh/mikaylagawarecki/350/head -> origin/gh/mikaylagawarecki/350/head 2025-12-04T08:54:02.2757267Z * [new branch] gh/mikaylagawarecki/350/orig -> origin/gh/mikaylagawarecki/350/orig 2025-12-04T08:54:02.2757356Z * [new branch] gh/mikaylagawarecki/351/base -> origin/gh/mikaylagawarecki/351/base 2025-12-04T08:54:02.2757447Z * [new branch] gh/mikaylagawarecki/351/head -> origin/gh/mikaylagawarecki/351/head 2025-12-04T08:54:02.2757538Z * [new branch] gh/mikaylagawarecki/351/orig -> origin/gh/mikaylagawarecki/351/orig 2025-12-04T08:54:02.2757627Z * [new branch] gh/mikaylagawarecki/352/base -> origin/gh/mikaylagawarecki/352/base 2025-12-04T08:54:02.2757717Z * [new branch] gh/mikaylagawarecki/352/head -> origin/gh/mikaylagawarecki/352/head 2025-12-04T08:54:02.2757807Z * [new branch] gh/mikaylagawarecki/352/orig -> origin/gh/mikaylagawarecki/352/orig 2025-12-04T08:54:02.2757896Z * [new branch] gh/mikaylagawarecki/353/base -> origin/gh/mikaylagawarecki/353/base 2025-12-04T08:54:02.2757987Z * [new branch] gh/mikaylagawarecki/353/head -> origin/gh/mikaylagawarecki/353/head 2025-12-04T08:54:02.2758075Z * [new branch] gh/mikaylagawarecki/353/orig -> origin/gh/mikaylagawarecki/353/orig 2025-12-04T08:54:02.2758164Z * [new branch] gh/mikaylagawarecki/354/base -> origin/gh/mikaylagawarecki/354/base 2025-12-04T08:54:02.2758301Z * [new branch] gh/mikaylagawarecki/354/head -> origin/gh/mikaylagawarecki/354/head 2025-12-04T08:54:02.2758390Z * [new branch] gh/mikaylagawarecki/354/orig -> origin/gh/mikaylagawarecki/354/orig 2025-12-04T08:54:02.2758480Z * [new branch] gh/mikaylagawarecki/356/base -> origin/gh/mikaylagawarecki/356/base 2025-12-04T08:54:02.2758571Z * [new branch] gh/mikaylagawarecki/356/head -> origin/gh/mikaylagawarecki/356/head 2025-12-04T08:54:02.2758660Z * [new branch] gh/mikaylagawarecki/356/orig -> origin/gh/mikaylagawarecki/356/orig 2025-12-04T08:54:02.2758749Z * [new branch] gh/mikaylagawarecki/357/base -> origin/gh/mikaylagawarecki/357/base 2025-12-04T08:54:02.2758839Z * [new branch] gh/mikaylagawarecki/357/head -> origin/gh/mikaylagawarecki/357/head 2025-12-04T08:54:02.2758928Z * [new branch] gh/mikaylagawarecki/357/orig -> origin/gh/mikaylagawarecki/357/orig 2025-12-04T08:54:02.2759023Z * [new branch] gh/mikaylagawarecki/359/base -> origin/gh/mikaylagawarecki/359/base 2025-12-04T08:54:02.2759111Z * [new branch] gh/mikaylagawarecki/359/head -> origin/gh/mikaylagawarecki/359/head 2025-12-04T08:54:02.2759201Z * [new branch] gh/mikaylagawarecki/359/orig -> origin/gh/mikaylagawarecki/359/orig 2025-12-04T08:54:02.2759321Z * [new branch] gh/mikaylagawarecki/360/base -> origin/gh/mikaylagawarecki/360/base 2025-12-04T08:54:02.2759411Z * [new branch] gh/mikaylagawarecki/360/head -> origin/gh/mikaylagawarecki/360/head 2025-12-04T08:54:02.2759501Z * [new branch] gh/mikaylagawarecki/360/orig -> origin/gh/mikaylagawarecki/360/orig 2025-12-04T08:54:02.2759591Z * [new branch] gh/mikaylagawarecki/361/base -> origin/gh/mikaylagawarecki/361/base 2025-12-04T08:54:02.2759680Z * [new branch] gh/mikaylagawarecki/361/head -> origin/gh/mikaylagawarecki/361/head 2025-12-04T08:54:02.2759771Z * [new branch] gh/mikaylagawarecki/361/orig -> origin/gh/mikaylagawarecki/361/orig 2025-12-04T08:54:02.2759861Z * [new branch] gh/mikaylagawarecki/362/base -> origin/gh/mikaylagawarecki/362/base 2025-12-04T08:54:02.2759950Z * [new branch] gh/mikaylagawarecki/362/head -> origin/gh/mikaylagawarecki/362/head 2025-12-04T08:54:02.2760041Z * [new branch] gh/mikaylagawarecki/362/orig -> origin/gh/mikaylagawarecki/362/orig 2025-12-04T08:54:02.2760132Z * [new branch] gh/mikaylagawarecki/363/base -> origin/gh/mikaylagawarecki/363/base 2025-12-04T08:54:02.2760222Z * [new branch] gh/mikaylagawarecki/363/head -> origin/gh/mikaylagawarecki/363/head 2025-12-04T08:54:02.2760313Z * [new branch] gh/mikaylagawarecki/363/orig -> origin/gh/mikaylagawarecki/363/orig 2025-12-04T08:54:02.2760401Z * [new branch] gh/mikaylagawarecki/364/base -> origin/gh/mikaylagawarecki/364/base 2025-12-04T08:54:02.2760492Z * [new branch] gh/mikaylagawarecki/364/head -> origin/gh/mikaylagawarecki/364/head 2025-12-04T08:54:02.2760582Z * [new branch] gh/mikaylagawarecki/364/orig -> origin/gh/mikaylagawarecki/364/orig 2025-12-04T08:54:02.2760671Z * [new branch] gh/mikaylagawarecki/365/base -> origin/gh/mikaylagawarecki/365/base 2025-12-04T08:54:02.2760760Z * [new branch] gh/mikaylagawarecki/365/head -> origin/gh/mikaylagawarecki/365/head 2025-12-04T08:54:02.2760850Z * [new branch] gh/mikaylagawarecki/365/orig -> origin/gh/mikaylagawarecki/365/orig 2025-12-04T08:54:02.2760939Z * [new branch] gh/mikaylagawarecki/366/base -> origin/gh/mikaylagawarecki/366/base 2025-12-04T08:54:02.2761027Z * [new branch] gh/mikaylagawarecki/366/head -> origin/gh/mikaylagawarecki/366/head 2025-12-04T08:54:02.2761117Z * [new branch] gh/mikaylagawarecki/366/orig -> origin/gh/mikaylagawarecki/366/orig 2025-12-04T08:54:02.2761236Z * [new branch] gh/mikaylagawarecki/367/base -> origin/gh/mikaylagawarecki/367/base 2025-12-04T08:54:02.2761325Z * [new branch] gh/mikaylagawarecki/367/head -> origin/gh/mikaylagawarecki/367/head 2025-12-04T08:54:02.2761415Z * [new branch] gh/mikaylagawarecki/367/orig -> origin/gh/mikaylagawarecki/367/orig 2025-12-04T08:54:02.2761505Z * [new branch] gh/mikaylagawarecki/368/base -> origin/gh/mikaylagawarecki/368/base 2025-12-04T08:54:02.2761593Z * [new branch] gh/mikaylagawarecki/368/head -> origin/gh/mikaylagawarecki/368/head 2025-12-04T08:54:02.2761683Z * [new branch] gh/mikaylagawarecki/368/orig -> origin/gh/mikaylagawarecki/368/orig 2025-12-04T08:54:02.2761772Z * [new branch] gh/mikaylagawarecki/369/base -> origin/gh/mikaylagawarecki/369/base 2025-12-04T08:54:02.2761862Z * [new branch] gh/mikaylagawarecki/369/head -> origin/gh/mikaylagawarecki/369/head 2025-12-04T08:54:02.2761952Z * [new branch] gh/mikaylagawarecki/369/orig -> origin/gh/mikaylagawarecki/369/orig 2025-12-04T08:54:02.2762041Z * [new branch] gh/mikaylagawarecki/370/base -> origin/gh/mikaylagawarecki/370/base 2025-12-04T08:54:02.2762130Z * [new branch] gh/mikaylagawarecki/370/head -> origin/gh/mikaylagawarecki/370/head 2025-12-04T08:54:02.2762250Z * [new branch] gh/mikaylagawarecki/370/orig -> origin/gh/mikaylagawarecki/370/orig 2025-12-04T08:54:02.2762339Z * [new branch] gh/mikaylagawarecki/371/base -> origin/gh/mikaylagawarecki/371/base 2025-12-04T08:54:02.2762429Z * [new branch] gh/mikaylagawarecki/371/head -> origin/gh/mikaylagawarecki/371/head 2025-12-04T08:54:02.2762517Z * [new branch] gh/mikaylagawarecki/371/orig -> origin/gh/mikaylagawarecki/371/orig 2025-12-04T08:54:02.2762606Z * [new branch] gh/mikaylagawarecki/372/base -> origin/gh/mikaylagawarecki/372/base 2025-12-04T08:54:02.2762697Z * [new branch] gh/mikaylagawarecki/372/head -> origin/gh/mikaylagawarecki/372/head 2025-12-04T08:54:02.2762786Z * [new branch] gh/mikaylagawarecki/372/orig -> origin/gh/mikaylagawarecki/372/orig 2025-12-04T08:54:02.2762874Z * [new branch] gh/mikaylagawarecki/373/base -> origin/gh/mikaylagawarecki/373/base 2025-12-04T08:54:02.2762965Z * [new branch] gh/mikaylagawarecki/373/head -> origin/gh/mikaylagawarecki/373/head 2025-12-04T08:54:02.2763055Z * [new branch] gh/mikaylagawarecki/373/orig -> origin/gh/mikaylagawarecki/373/orig 2025-12-04T08:54:02.2763145Z * [new branch] gh/mikaylagawarecki/374/base -> origin/gh/mikaylagawarecki/374/base 2025-12-04T08:54:02.2763237Z * [new branch] gh/mikaylagawarecki/374/head -> origin/gh/mikaylagawarecki/374/head 2025-12-04T08:54:02.2763328Z * [new branch] gh/mikaylagawarecki/374/orig -> origin/gh/mikaylagawarecki/374/orig 2025-12-04T08:54:02.2763427Z * [new branch] gh/mikaylagawarecki/375/base -> origin/gh/mikaylagawarecki/375/base 2025-12-04T08:54:02.2763518Z * [new branch] gh/mikaylagawarecki/375/head -> origin/gh/mikaylagawarecki/375/head 2025-12-04T08:54:02.2763610Z * [new branch] gh/mikaylagawarecki/375/orig -> origin/gh/mikaylagawarecki/375/orig 2025-12-04T08:54:02.2763707Z * [new branch] gh/mikaylagawarecki/376/base -> origin/gh/mikaylagawarecki/376/base 2025-12-04T08:54:02.2763798Z * [new branch] gh/mikaylagawarecki/376/head -> origin/gh/mikaylagawarecki/376/head 2025-12-04T08:54:02.2763889Z * [new branch] gh/mikaylagawarecki/376/orig -> origin/gh/mikaylagawarecki/376/orig 2025-12-04T08:54:02.2763984Z * [new branch] gh/mikaylagawarecki/377/base -> origin/gh/mikaylagawarecki/377/base 2025-12-04T08:54:02.2764076Z * [new branch] gh/mikaylagawarecki/377/head -> origin/gh/mikaylagawarecki/377/head 2025-12-04T08:54:02.2764198Z * [new branch] gh/mikaylagawarecki/377/orig -> origin/gh/mikaylagawarecki/377/orig 2025-12-04T08:54:02.2764291Z * [new branch] gh/mikaylagawarecki/378/base -> origin/gh/mikaylagawarecki/378/base 2025-12-04T08:54:02.2764382Z * [new branch] gh/mikaylagawarecki/378/head -> origin/gh/mikaylagawarecki/378/head 2025-12-04T08:54:02.2764477Z * [new branch] gh/mikaylagawarecki/378/orig -> origin/gh/mikaylagawarecki/378/orig 2025-12-04T08:54:02.2764568Z * [new branch] gh/mikaylagawarecki/379/base -> origin/gh/mikaylagawarecki/379/base 2025-12-04T08:54:02.2764659Z * [new branch] gh/mikaylagawarecki/379/head -> origin/gh/mikaylagawarecki/379/head 2025-12-04T08:54:02.2764752Z * [new branch] gh/mikaylagawarecki/379/orig -> origin/gh/mikaylagawarecki/379/orig 2025-12-04T08:54:02.2764844Z * [new branch] gh/mikaylagawarecki/380/base -> origin/gh/mikaylagawarecki/380/base 2025-12-04T08:54:02.2764936Z * [new branch] gh/mikaylagawarecki/380/head -> origin/gh/mikaylagawarecki/380/head 2025-12-04T08:54:02.2765027Z * [new branch] gh/mikaylagawarecki/380/orig -> origin/gh/mikaylagawarecki/380/orig 2025-12-04T08:54:02.2765119Z * [new branch] gh/mikaylagawarecki/381/base -> origin/gh/mikaylagawarecki/381/base 2025-12-04T08:54:02.2765235Z * [new branch] gh/mikaylagawarecki/381/head -> origin/gh/mikaylagawarecki/381/head 2025-12-04T08:54:02.2765333Z * [new branch] gh/mikaylagawarecki/381/orig -> origin/gh/mikaylagawarecki/381/orig 2025-12-04T08:54:02.2765425Z * [new branch] gh/mikaylagawarecki/382/base -> origin/gh/mikaylagawarecki/382/base 2025-12-04T08:54:02.2765516Z * [new branch] gh/mikaylagawarecki/382/head -> origin/gh/mikaylagawarecki/382/head 2025-12-04T08:54:02.2765611Z * [new branch] gh/mikaylagawarecki/382/orig -> origin/gh/mikaylagawarecki/382/orig 2025-12-04T08:54:02.2765704Z * [new branch] gh/mikaylagawarecki/383/base -> origin/gh/mikaylagawarecki/383/base 2025-12-04T08:54:02.2765793Z * [new branch] gh/mikaylagawarecki/383/head -> origin/gh/mikaylagawarecki/383/head 2025-12-04T08:54:02.2765889Z * [new branch] gh/mikaylagawarecki/383/orig -> origin/gh/mikaylagawarecki/383/orig 2025-12-04T08:54:02.2766014Z * [new branch] gh/mikaylagawarecki/384/base -> origin/gh/mikaylagawarecki/384/base 2025-12-04T08:54:02.2766108Z * [new branch] gh/mikaylagawarecki/384/head -> origin/gh/mikaylagawarecki/384/head 2025-12-04T08:54:02.2766199Z * [new branch] gh/mikaylagawarecki/384/orig -> origin/gh/mikaylagawarecki/384/orig 2025-12-04T08:54:02.2766290Z * [new branch] gh/mikaylagawarecki/385/base -> origin/gh/mikaylagawarecki/385/base 2025-12-04T08:54:02.2766385Z * [new branch] gh/mikaylagawarecki/385/head -> origin/gh/mikaylagawarecki/385/head 2025-12-04T08:54:02.2766477Z * [new branch] gh/mikaylagawarecki/385/orig -> origin/gh/mikaylagawarecki/385/orig 2025-12-04T08:54:02.2766567Z * [new branch] gh/mikaylagawarecki/386/base -> origin/gh/mikaylagawarecki/386/base 2025-12-04T08:54:02.2766657Z * [new branch] gh/mikaylagawarecki/386/head -> origin/gh/mikaylagawarecki/386/head 2025-12-04T08:54:02.2766748Z * [new branch] gh/mikaylagawarecki/386/orig -> origin/gh/mikaylagawarecki/386/orig 2025-12-04T08:54:02.2766837Z * [new branch] gh/mikaylagawarecki/387/base -> origin/gh/mikaylagawarecki/387/base 2025-12-04T08:54:02.2766929Z * [new branch] gh/mikaylagawarecki/387/head -> origin/gh/mikaylagawarecki/387/head 2025-12-04T08:54:02.2767021Z * [new branch] gh/mikaylagawarecki/387/orig -> origin/gh/mikaylagawarecki/387/orig 2025-12-04T08:54:02.2767112Z * [new branch] gh/mikaylagawarecki/388/base -> origin/gh/mikaylagawarecki/388/base 2025-12-04T08:54:02.2767250Z * [new branch] gh/mikaylagawarecki/388/head -> origin/gh/mikaylagawarecki/388/head 2025-12-04T08:54:02.2767340Z * [new branch] gh/mikaylagawarecki/388/orig -> origin/gh/mikaylagawarecki/388/orig 2025-12-04T08:54:02.2767430Z * [new branch] gh/mikaylagawarecki/389/base -> origin/gh/mikaylagawarecki/389/base 2025-12-04T08:54:02.2767523Z * [new branch] gh/mikaylagawarecki/389/head -> origin/gh/mikaylagawarecki/389/head 2025-12-04T08:54:02.2767615Z * [new branch] gh/mikaylagawarecki/389/orig -> origin/gh/mikaylagawarecki/389/orig 2025-12-04T08:54:02.2767706Z * [new branch] gh/mikaylagawarecki/390/base -> origin/gh/mikaylagawarecki/390/base 2025-12-04T08:54:02.2767795Z * [new branch] gh/mikaylagawarecki/390/head -> origin/gh/mikaylagawarecki/390/head 2025-12-04T08:54:02.2767886Z * [new branch] gh/mikaylagawarecki/390/orig -> origin/gh/mikaylagawarecki/390/orig 2025-12-04T08:54:02.2767979Z * [new branch] gh/mikaylagawarecki/391/base -> origin/gh/mikaylagawarecki/391/base 2025-12-04T08:54:02.2768069Z * [new branch] gh/mikaylagawarecki/391/head -> origin/gh/mikaylagawarecki/391/head 2025-12-04T08:54:02.2768159Z * [new branch] gh/mikaylagawarecki/391/orig -> origin/gh/mikaylagawarecki/391/orig 2025-12-04T08:54:02.2768293Z * [new branch] gh/mikaylagawarecki/392/base -> origin/gh/mikaylagawarecki/392/base 2025-12-04T08:54:02.2768383Z * [new branch] gh/mikaylagawarecki/392/head -> origin/gh/mikaylagawarecki/392/head 2025-12-04T08:54:02.2768473Z * [new branch] gh/mikaylagawarecki/392/orig -> origin/gh/mikaylagawarecki/392/orig 2025-12-04T08:54:02.2768546Z * [new branch] gh/mlazos/41/base -> origin/gh/mlazos/41/base 2025-12-04T08:54:02.2768614Z * [new branch] gh/mlazos/41/head -> origin/gh/mlazos/41/head 2025-12-04T08:54:02.2768683Z * [new branch] gh/mlazos/41/orig -> origin/gh/mlazos/41/orig 2025-12-04T08:54:02.2768756Z * [new branch] gh/mlazos/42/base -> origin/gh/mlazos/42/base 2025-12-04T08:54:02.2768822Z * [new branch] gh/mlazos/42/head -> origin/gh/mlazos/42/head 2025-12-04T08:54:02.2768890Z * [new branch] gh/mlazos/42/orig -> origin/gh/mlazos/42/orig 2025-12-04T08:54:02.2768955Z * [new branch] gh/mlazos/43/base -> origin/gh/mlazos/43/base 2025-12-04T08:54:02.2769020Z * [new branch] gh/mlazos/43/head -> origin/gh/mlazos/43/head 2025-12-04T08:54:02.2769090Z * [new branch] gh/mlazos/43/orig -> origin/gh/mlazos/43/orig 2025-12-04T08:54:02.2769155Z * [new branch] gh/mlazos/44/base -> origin/gh/mlazos/44/base 2025-12-04T08:54:02.2769220Z * [new branch] gh/mlazos/44/head -> origin/gh/mlazos/44/head 2025-12-04T08:54:02.2769287Z * [new branch] gh/mlazos/44/orig -> origin/gh/mlazos/44/orig 2025-12-04T08:54:02.2769352Z * [new branch] gh/mlazos/47/base -> origin/gh/mlazos/47/base 2025-12-04T08:54:02.2769417Z * [new branch] gh/mlazos/47/head -> origin/gh/mlazos/47/head 2025-12-04T08:54:02.2769487Z * [new branch] gh/mlazos/47/orig -> origin/gh/mlazos/47/orig 2025-12-04T08:54:02.2769553Z * [new branch] gh/mlazos/48/base -> origin/gh/mlazos/48/base 2025-12-04T08:54:02.2769619Z * [new branch] gh/mlazos/48/head -> origin/gh/mlazos/48/head 2025-12-04T08:54:02.2769685Z * [new branch] gh/mlazos/48/orig -> origin/gh/mlazos/48/orig 2025-12-04T08:54:02.2769750Z * [new branch] gh/mlazos/49/base -> origin/gh/mlazos/49/base 2025-12-04T08:54:02.2769815Z * [new branch] gh/mlazos/49/head -> origin/gh/mlazos/49/head 2025-12-04T08:54:02.2769913Z * [new branch] gh/mlazos/49/orig -> origin/gh/mlazos/49/orig 2025-12-04T08:54:02.2769979Z * [new branch] gh/mlazos/50/base -> origin/gh/mlazos/50/base 2025-12-04T08:54:02.2770044Z * [new branch] gh/mlazos/50/head -> origin/gh/mlazos/50/head 2025-12-04T08:54:02.2770110Z * [new branch] gh/mlazos/50/orig -> origin/gh/mlazos/50/orig 2025-12-04T08:54:02.2770189Z * [new branch] gh/mlazos/51/base -> origin/gh/mlazos/51/base 2025-12-04T08:54:02.2770265Z * [new branch] gh/mlazos/51/head -> origin/gh/mlazos/51/head 2025-12-04T08:54:02.2770362Z * [new branch] gh/mlazos/51/orig -> origin/gh/mlazos/51/orig 2025-12-04T08:54:02.2770468Z * [new branch] gh/mlazos/52/base -> origin/gh/mlazos/52/base 2025-12-04T08:54:02.2770552Z * [new branch] gh/mlazos/52/head -> origin/gh/mlazos/52/head 2025-12-04T08:54:02.2770644Z * [new branch] gh/mlazos/52/orig -> origin/gh/mlazos/52/orig 2025-12-04T08:54:02.2770721Z * [new branch] gh/mlazos/53/base -> origin/gh/mlazos/53/base 2025-12-04T08:54:02.2770807Z * [new branch] gh/mlazos/53/head -> origin/gh/mlazos/53/head 2025-12-04T08:54:02.2770892Z * [new branch] gh/mlazos/53/orig -> origin/gh/mlazos/53/orig 2025-12-04T08:54:02.2771014Z * [new branch] gh/mlazos/54/base -> origin/gh/mlazos/54/base 2025-12-04T08:54:02.2771109Z * [new branch] gh/mlazos/54/head -> origin/gh/mlazos/54/head 2025-12-04T08:54:02.2771192Z * [new branch] gh/mlazos/54/orig -> origin/gh/mlazos/54/orig 2025-12-04T08:54:02.2771268Z * [new branch] gh/mlazos/55/base -> origin/gh/mlazos/55/base 2025-12-04T08:54:02.2771362Z * [new branch] gh/mlazos/55/head -> origin/gh/mlazos/55/head 2025-12-04T08:54:02.2771433Z * [new branch] gh/mlazos/55/orig -> origin/gh/mlazos/55/orig 2025-12-04T08:54:02.2771519Z * [new branch] gh/mlazos/56/base -> origin/gh/mlazos/56/base 2025-12-04T08:54:02.2771614Z * [new branch] gh/mlazos/56/head -> origin/gh/mlazos/56/head 2025-12-04T08:54:02.2771696Z * [new branch] gh/mlazos/56/orig -> origin/gh/mlazos/56/orig 2025-12-04T08:54:02.2771774Z * [new branch] gh/mlazos/57/base -> origin/gh/mlazos/57/base 2025-12-04T08:54:02.2771864Z * [new branch] gh/mlazos/57/head -> origin/gh/mlazos/57/head 2025-12-04T08:54:02.2771936Z * [new branch] gh/mlazos/57/orig -> origin/gh/mlazos/57/orig 2025-12-04T08:54:02.2772022Z * [new branch] gh/mlazos/58/base -> origin/gh/mlazos/58/base 2025-12-04T08:54:02.2772121Z * [new branch] gh/mlazos/58/head -> origin/gh/mlazos/58/head 2025-12-04T08:54:02.2772199Z * [new branch] gh/mlazos/58/orig -> origin/gh/mlazos/58/orig 2025-12-04T08:54:02.2772277Z * [new branch] gh/mlazos/59/base -> origin/gh/mlazos/59/base 2025-12-04T08:54:02.2772366Z * [new branch] gh/mlazos/59/head -> origin/gh/mlazos/59/head 2025-12-04T08:54:02.2772436Z * [new branch] gh/mlazos/59/orig -> origin/gh/mlazos/59/orig 2025-12-04T08:54:02.2772541Z * [new branch] gh/mlazos/60/base -> origin/gh/mlazos/60/base 2025-12-04T08:54:02.2772624Z * [new branch] gh/mlazos/60/head -> origin/gh/mlazos/60/head 2025-12-04T08:54:02.2772703Z * [new branch] gh/mlazos/60/orig -> origin/gh/mlazos/60/orig 2025-12-04T08:54:02.2772792Z * [new branch] gh/mlazos/61/base -> origin/gh/mlazos/61/base 2025-12-04T08:54:02.2772868Z * [new branch] gh/mlazos/61/head -> origin/gh/mlazos/61/head 2025-12-04T08:54:02.2772938Z * [new branch] gh/mlazos/61/orig -> origin/gh/mlazos/61/orig 2025-12-04T08:54:02.2773082Z * [new branch] gh/mlazos/62/base -> origin/gh/mlazos/62/base 2025-12-04T08:54:02.2773158Z * [new branch] gh/mlazos/62/head -> origin/gh/mlazos/62/head 2025-12-04T08:54:02.2773235Z * [new branch] gh/mlazos/62/orig -> origin/gh/mlazos/62/orig 2025-12-04T08:54:02.2773325Z * [new branch] gh/mlazos/63/base -> origin/gh/mlazos/63/base 2025-12-04T08:54:02.2773402Z * [new branch] gh/mlazos/63/head -> origin/gh/mlazos/63/head 2025-12-04T08:54:02.2773472Z * [new branch] gh/mlazos/63/orig -> origin/gh/mlazos/63/orig 2025-12-04T08:54:02.2773586Z * [new branch] gh/mlazos/64/base -> origin/gh/mlazos/64/base 2025-12-04T08:54:02.2773665Z * [new branch] gh/mlazos/64/head -> origin/gh/mlazos/64/head 2025-12-04T08:54:02.2773740Z * [new branch] gh/mlazos/64/orig -> origin/gh/mlazos/64/orig 2025-12-04T08:54:02.2773829Z * [new branch] gh/mlazos/65/base -> origin/gh/mlazos/65/base 2025-12-04T08:54:02.2773905Z * [new branch] gh/mlazos/65/head -> origin/gh/mlazos/65/head 2025-12-04T08:54:02.2774005Z * [new branch] gh/mlazos/65/orig -> origin/gh/mlazos/65/orig 2025-12-04T08:54:02.2774114Z * [new branch] gh/mlazos/66/base -> origin/gh/mlazos/66/base 2025-12-04T08:54:02.2774190Z * [new branch] gh/mlazos/66/head -> origin/gh/mlazos/66/head 2025-12-04T08:54:02.2774279Z * [new branch] gh/mlazos/66/orig -> origin/gh/mlazos/66/orig 2025-12-04T08:54:02.2774355Z * [new branch] gh/mlazos/67/base -> origin/gh/mlazos/67/base 2025-12-04T08:54:02.2774436Z * [new branch] gh/mlazos/67/head -> origin/gh/mlazos/67/head 2025-12-04T08:54:02.2774532Z * [new branch] gh/mlazos/67/orig -> origin/gh/mlazos/67/orig 2025-12-04T08:54:02.2774614Z * [new branch] gh/mlazos/68/base -> origin/gh/mlazos/68/base 2025-12-04T08:54:02.2774692Z * [new branch] gh/mlazos/68/head -> origin/gh/mlazos/68/head 2025-12-04T08:54:02.2774780Z * [new branch] gh/mlazos/68/orig -> origin/gh/mlazos/68/orig 2025-12-04T08:54:02.2774865Z * [new branch] gh/mlazos/69/base -> origin/gh/mlazos/69/base 2025-12-04T08:54:02.2774941Z * [new branch] gh/mlazos/69/head -> origin/gh/mlazos/69/head 2025-12-04T08:54:02.2775037Z * [new branch] gh/mlazos/69/orig -> origin/gh/mlazos/69/orig 2025-12-04T08:54:02.2775119Z * [new branch] gh/mlazos/70/base -> origin/gh/mlazos/70/base 2025-12-04T08:54:02.2775197Z * [new branch] gh/mlazos/70/head -> origin/gh/mlazos/70/head 2025-12-04T08:54:02.2775286Z * [new branch] gh/mlazos/70/orig -> origin/gh/mlazos/70/orig 2025-12-04T08:54:02.2775368Z * [new branch] gh/mlazos/71/base -> origin/gh/mlazos/71/base 2025-12-04T08:54:02.2775453Z * [new branch] gh/mlazos/71/head -> origin/gh/mlazos/71/head 2025-12-04T08:54:02.2775540Z * [new branch] gh/mlazos/71/orig -> origin/gh/mlazos/71/orig 2025-12-04T08:54:02.2775623Z * [new branch] gh/mlazos/72/base -> origin/gh/mlazos/72/base 2025-12-04T08:54:02.2775711Z * [new branch] gh/mlazos/72/head -> origin/gh/mlazos/72/head 2025-12-04T08:54:02.2775794Z * [new branch] gh/mlazos/72/orig -> origin/gh/mlazos/72/orig 2025-12-04T08:54:02.2775870Z * [new branch] gh/mlazos/73/base -> origin/gh/mlazos/73/base 2025-12-04T08:54:02.2776001Z * [new branch] gh/mlazos/73/head -> origin/gh/mlazos/73/head 2025-12-04T08:54:02.2776095Z * [new branch] gh/mlazos/73/orig -> origin/gh/mlazos/73/orig 2025-12-04T08:54:02.2776238Z * [new branch] gh/mrmiywj/1/base -> origin/gh/mrmiywj/1/base 2025-12-04T08:54:02.2776334Z * [new branch] gh/mrmiywj/1/head -> origin/gh/mrmiywj/1/head 2025-12-04T08:54:02.2776420Z * [new branch] gh/muchulee8/73/base -> origin/gh/muchulee8/73/base 2025-12-04T08:54:02.2776504Z * [new branch] gh/muchulee8/73/head -> origin/gh/muchulee8/73/head 2025-12-04T08:54:02.2776594Z * [new branch] gh/muchulee8/73/orig -> origin/gh/muchulee8/73/orig 2025-12-04T08:54:02.2776704Z * [new branch] gh/naveenthangudu/1/base -> origin/gh/naveenthangudu/1/base 2025-12-04T08:54:02.2776801Z * [new branch] gh/naveenthangudu/1/head -> origin/gh/naveenthangudu/1/head 2025-12-04T08:54:02.2776907Z * [new branch] gh/naveenthangudu/1/orig -> origin/gh/naveenthangudu/1/orig 2025-12-04T08:54:02.2776998Z * [new branch] gh/naveenthangudu/2/base -> origin/gh/naveenthangudu/2/base 2025-12-04T08:54:02.2777101Z * [new branch] gh/naveenthangudu/2/head -> origin/gh/naveenthangudu/2/head 2025-12-04T08:54:02.2777185Z * [new branch] gh/naveenthangudu/2/orig -> origin/gh/naveenthangudu/2/orig 2025-12-04T08:54:02.2777287Z * [new branch] gh/naveenthangudu/3/base -> origin/gh/naveenthangudu/3/base 2025-12-04T08:54:02.2777440Z * [new branch] gh/naveenthangudu/3/head -> origin/gh/naveenthangudu/3/head 2025-12-04T08:54:02.2777529Z * [new branch] gh/naveenthangudu/3/orig -> origin/gh/naveenthangudu/3/orig 2025-12-04T08:54:02.2777618Z * [new branch] gh/naveenthangudu/4/base -> origin/gh/naveenthangudu/4/base 2025-12-04T08:54:02.2777721Z * [new branch] gh/naveenthangudu/4/head -> origin/gh/naveenthangudu/4/head 2025-12-04T08:54:02.2777805Z * [new branch] gh/naveenthangudu/4/orig -> origin/gh/naveenthangudu/4/orig 2025-12-04T08:54:02.2777904Z * [new branch] gh/naveenthangudu/5/base -> origin/gh/naveenthangudu/5/base 2025-12-04T08:54:02.2778020Z * [new branch] gh/naveenthangudu/5/head -> origin/gh/naveenthangudu/5/head 2025-12-04T08:54:02.2778109Z * [new branch] gh/naveenthangudu/5/orig -> origin/gh/naveenthangudu/5/orig 2025-12-04T08:54:02.2778198Z * [new branch] gh/naveenthangudu/6/base -> origin/gh/naveenthangudu/6/base 2025-12-04T08:54:02.2778303Z * [new branch] gh/naveenthangudu/6/head -> origin/gh/naveenthangudu/6/head 2025-12-04T08:54:02.2778388Z * [new branch] gh/naveenthangudu/6/orig -> origin/gh/naveenthangudu/6/orig 2025-12-04T08:54:02.2778518Z * [new branch] gh/naveenthangudu/7/base -> origin/gh/naveenthangudu/7/base 2025-12-04T08:54:02.2778607Z * [new branch] gh/naveenthangudu/7/head -> origin/gh/naveenthangudu/7/head 2025-12-04T08:54:02.2778697Z * [new branch] gh/naveenthangudu/7/orig -> origin/gh/naveenthangudu/7/orig 2025-12-04T08:54:02.2778800Z * [new branch] gh/naveenthangudu/8/base -> origin/gh/naveenthangudu/8/base 2025-12-04T08:54:02.2778888Z * [new branch] gh/naveenthangudu/8/head -> origin/gh/naveenthangudu/8/head 2025-12-04T08:54:02.2778972Z * [new branch] gh/naveenthangudu/8/orig -> origin/gh/naveenthangudu/8/orig 2025-12-04T08:54:02.2779100Z * [new branch] gh/naveenthangudu/9/base -> origin/gh/naveenthangudu/9/base 2025-12-04T08:54:02.2779190Z * [new branch] gh/naveenthangudu/9/head -> origin/gh/naveenthangudu/9/head 2025-12-04T08:54:02.2779280Z * [new branch] gh/naveenthangudu/9/orig -> origin/gh/naveenthangudu/9/orig 2025-12-04T08:54:02.2779376Z * [new branch] gh/nikitaved/1/base -> origin/gh/nikitaved/1/base 2025-12-04T08:54:02.2779458Z * [new branch] gh/nikitaved/1/head -> origin/gh/nikitaved/1/head 2025-12-04T08:54:02.2779542Z * [new branch] gh/nikitaved/1/orig -> origin/gh/nikitaved/1/orig 2025-12-04T08:54:02.2779698Z * [new branch] gh/nikitaved/10/base -> origin/gh/nikitaved/10/base 2025-12-04T08:54:02.2779781Z * [new branch] gh/nikitaved/10/head -> origin/gh/nikitaved/10/head 2025-12-04T08:54:02.2779875Z * [new branch] gh/nikitaved/10/orig -> origin/gh/nikitaved/10/orig 2025-12-04T08:54:02.2779971Z * [new branch] gh/nikitaved/11/base -> origin/gh/nikitaved/11/base 2025-12-04T08:54:02.2780055Z * [new branch] gh/nikitaved/11/head -> origin/gh/nikitaved/11/head 2025-12-04T08:54:02.2780172Z * [new branch] gh/nikitaved/11/orig -> origin/gh/nikitaved/11/orig 2025-12-04T08:54:02.2780260Z * [new branch] gh/nikitaved/12/base -> origin/gh/nikitaved/12/base 2025-12-04T08:54:02.2780342Z * [new branch] gh/nikitaved/12/head -> origin/gh/nikitaved/12/head 2025-12-04T08:54:02.2780435Z * [new branch] gh/nikitaved/12/orig -> origin/gh/nikitaved/12/orig 2025-12-04T08:54:02.2780523Z * [new branch] gh/nikitaved/13/base -> origin/gh/nikitaved/13/base 2025-12-04T08:54:02.2780604Z * [new branch] gh/nikitaved/13/head -> origin/gh/nikitaved/13/head 2025-12-04T08:54:02.2780705Z * [new branch] gh/nikitaved/13/orig -> origin/gh/nikitaved/13/orig 2025-12-04T08:54:02.2780826Z * [new branch] gh/nikitaved/14/base -> origin/gh/nikitaved/14/base 2025-12-04T08:54:02.2780907Z * [new branch] gh/nikitaved/14/head -> origin/gh/nikitaved/14/head 2025-12-04T08:54:02.2781008Z * [new branch] gh/nikitaved/14/orig -> origin/gh/nikitaved/14/orig 2025-12-04T08:54:02.2781089Z * [new branch] gh/nikitaved/15/base -> origin/gh/nikitaved/15/base 2025-12-04T08:54:02.2781170Z * [new branch] gh/nikitaved/15/head -> origin/gh/nikitaved/15/head 2025-12-04T08:54:02.2781269Z * [new branch] gh/nikitaved/15/orig -> origin/gh/nikitaved/15/orig 2025-12-04T08:54:02.2781364Z * [new branch] gh/nikitaved/16/base -> origin/gh/nikitaved/16/base 2025-12-04T08:54:02.2781463Z * [new branch] gh/nikitaved/16/head -> origin/gh/nikitaved/16/head 2025-12-04T08:54:02.2781544Z * [new branch] gh/nikitaved/16/orig -> origin/gh/nikitaved/16/orig 2025-12-04T08:54:02.2781629Z * [new branch] gh/nikitaved/2/base -> origin/gh/nikitaved/2/base 2025-12-04T08:54:02.2781719Z * [new branch] gh/nikitaved/2/head -> origin/gh/nikitaved/2/head 2025-12-04T08:54:02.2783490Z * [new branch] gh/nikitaved/2/orig -> origin/gh/nikitaved/2/orig 2025-12-04T08:54:02.2783577Z * [new branch] gh/nikitaved/4/base -> origin/gh/nikitaved/4/base 2025-12-04T08:54:02.2783674Z * [new branch] gh/nikitaved/4/head -> origin/gh/nikitaved/4/head 2025-12-04T08:54:02.2783755Z * [new branch] gh/nikitaved/4/orig -> origin/gh/nikitaved/4/orig 2025-12-04T08:54:02.2783839Z * [new branch] gh/nikitaved/5/base -> origin/gh/nikitaved/5/base 2025-12-04T08:54:02.2783927Z * [new branch] gh/nikitaved/5/head -> origin/gh/nikitaved/5/head 2025-12-04T08:54:02.2784017Z * [new branch] gh/nikitaved/5/orig -> origin/gh/nikitaved/5/orig 2025-12-04T08:54:02.2784105Z * [new branch] gh/nikitaved/6/base -> origin/gh/nikitaved/6/base 2025-12-04T08:54:02.2784203Z * [new branch] gh/nikitaved/6/head -> origin/gh/nikitaved/6/head 2025-12-04T08:54:02.2784283Z * [new branch] gh/nikitaved/6/orig -> origin/gh/nikitaved/6/orig 2025-12-04T08:54:02.2784363Z * [new branch] gh/nikitaved/8/base -> origin/gh/nikitaved/8/base 2025-12-04T08:54:02.2784453Z * [new branch] gh/nikitaved/8/head -> origin/gh/nikitaved/8/head 2025-12-04T08:54:02.2784544Z * [new branch] gh/nikitaved/8/orig -> origin/gh/nikitaved/8/orig 2025-12-04T08:54:02.2784680Z * [new branch] gh/nikitaved/9/base -> origin/gh/nikitaved/9/base 2025-12-04T08:54:02.2784760Z * [new branch] gh/nikitaved/9/head -> origin/gh/nikitaved/9/head 2025-12-04T08:54:02.2784841Z * [new branch] gh/nikitaved/9/orig -> origin/gh/nikitaved/9/orig 2025-12-04T08:54:02.2784934Z * [new branch] gh/oulgen/10/base -> origin/gh/oulgen/10/base 2025-12-04T08:54:02.2785007Z * [new branch] gh/oulgen/10/head -> origin/gh/oulgen/10/head 2025-12-04T08:54:02.2785104Z * [new branch] gh/oulgen/10/orig -> origin/gh/oulgen/10/orig 2025-12-04T08:54:02.2785201Z * [new branch] gh/oulgen/11/base -> origin/gh/oulgen/11/base 2025-12-04T08:54:02.2785278Z * [new branch] gh/oulgen/11/head -> origin/gh/oulgen/11/head 2025-12-04T08:54:02.2785354Z * [new branch] gh/oulgen/11/orig -> origin/gh/oulgen/11/orig 2025-12-04T08:54:02.2785446Z * [new branch] gh/oulgen/12/base -> origin/gh/oulgen/12/base 2025-12-04T08:54:02.2785516Z * [new branch] gh/oulgen/12/head -> origin/gh/oulgen/12/head 2025-12-04T08:54:02.2785606Z * [new branch] gh/oulgen/12/orig -> origin/gh/oulgen/12/orig 2025-12-04T08:54:02.2785728Z * [new branch] gh/oulgen/13/base -> origin/gh/oulgen/13/base 2025-12-04T08:54:02.2785804Z * [new branch] gh/oulgen/13/head -> origin/gh/oulgen/13/head 2025-12-04T08:54:02.2785880Z * [new branch] gh/oulgen/13/orig -> origin/gh/oulgen/13/orig 2025-12-04T08:54:02.2786013Z * [new branch] gh/oulgen/14/base -> origin/gh/oulgen/14/base 2025-12-04T08:54:02.2786086Z * [new branch] gh/oulgen/14/head -> origin/gh/oulgen/14/head 2025-12-04T08:54:02.2786190Z * [new branch] gh/oulgen/14/orig -> origin/gh/oulgen/14/orig 2025-12-04T08:54:02.2786268Z * [new branch] gh/oulgen/15/base -> origin/gh/oulgen/15/base 2025-12-04T08:54:02.2786346Z * [new branch] gh/oulgen/15/head -> origin/gh/oulgen/15/head 2025-12-04T08:54:02.2786439Z * [new branch] gh/oulgen/15/orig -> origin/gh/oulgen/15/orig 2025-12-04T08:54:02.2786517Z * [new branch] gh/oulgen/16/base -> origin/gh/oulgen/16/base 2025-12-04T08:54:02.2786588Z * [new branch] gh/oulgen/16/head -> origin/gh/oulgen/16/head 2025-12-04T08:54:02.2786692Z * [new branch] gh/oulgen/16/orig -> origin/gh/oulgen/16/orig 2025-12-04T08:54:02.2786768Z * [new branch] gh/oulgen/17/base -> origin/gh/oulgen/17/base 2025-12-04T08:54:02.2786848Z * [new branch] gh/oulgen/17/head -> origin/gh/oulgen/17/head 2025-12-04T08:54:02.2786941Z * [new branch] gh/oulgen/17/orig -> origin/gh/oulgen/17/orig 2025-12-04T08:54:02.2787020Z * [new branch] gh/oulgen/18/base -> origin/gh/oulgen/18/base 2025-12-04T08:54:02.2787091Z * [new branch] gh/oulgen/18/head -> origin/gh/oulgen/18/head 2025-12-04T08:54:02.2787196Z * [new branch] gh/oulgen/18/orig -> origin/gh/oulgen/18/orig 2025-12-04T08:54:02.2787271Z * [new branch] gh/oulgen/19/base -> origin/gh/oulgen/19/base 2025-12-04T08:54:02.2787348Z * [new branch] gh/oulgen/19/head -> origin/gh/oulgen/19/head 2025-12-04T08:54:02.2787442Z * [new branch] gh/oulgen/19/orig -> origin/gh/oulgen/19/orig 2025-12-04T08:54:02.2787518Z * [new branch] gh/oulgen/20/base -> origin/gh/oulgen/20/base 2025-12-04T08:54:02.2787615Z * [new branch] gh/oulgen/20/head -> origin/gh/oulgen/20/head 2025-12-04T08:54:02.2787697Z * [new branch] gh/oulgen/20/orig -> origin/gh/oulgen/20/orig 2025-12-04T08:54:02.2787817Z * [new branch] gh/oulgen/21/base -> origin/gh/oulgen/21/base 2025-12-04T08:54:02.2787911Z * [new branch] gh/oulgen/21/head -> origin/gh/oulgen/21/head 2025-12-04T08:54:02.2787988Z * [new branch] gh/oulgen/21/orig -> origin/gh/oulgen/21/orig 2025-12-04T08:54:02.2788063Z * [new branch] gh/oulgen/22/base -> origin/gh/oulgen/22/base 2025-12-04T08:54:02.2788160Z * [new branch] gh/oulgen/22/head -> origin/gh/oulgen/22/head 2025-12-04T08:54:02.2788242Z * [new branch] gh/oulgen/22/orig -> origin/gh/oulgen/22/orig 2025-12-04T08:54:02.2788323Z * [new branch] gh/oulgen/23/base -> origin/gh/oulgen/23/base 2025-12-04T08:54:02.2788413Z * [new branch] gh/oulgen/23/head -> origin/gh/oulgen/23/head 2025-12-04T08:54:02.2788490Z * [new branch] gh/oulgen/23/orig -> origin/gh/oulgen/23/orig 2025-12-04T08:54:02.2788567Z * [new branch] gh/oulgen/24/base -> origin/gh/oulgen/24/base 2025-12-04T08:54:02.2788661Z * [new branch] gh/oulgen/24/head -> origin/gh/oulgen/24/head 2025-12-04T08:54:02.2788743Z * [new branch] gh/oulgen/24/orig -> origin/gh/oulgen/24/orig 2025-12-04T08:54:02.2788825Z * [new branch] gh/oulgen/25/base -> origin/gh/oulgen/25/base 2025-12-04T08:54:02.2788956Z * [new branch] gh/oulgen/25/head -> origin/gh/oulgen/25/head 2025-12-04T08:54:02.2789037Z * [new branch] gh/oulgen/25/orig -> origin/gh/oulgen/25/orig 2025-12-04T08:54:02.2789122Z * [new branch] gh/oulgen/26/base -> origin/gh/oulgen/26/base 2025-12-04T08:54:02.2789210Z * [new branch] gh/oulgen/26/head -> origin/gh/oulgen/26/head 2025-12-04T08:54:02.2789300Z * [new branch] gh/oulgen/26/orig -> origin/gh/oulgen/26/orig 2025-12-04T08:54:02.2789393Z * [new branch] gh/oulgen/4/base -> origin/gh/oulgen/4/base 2025-12-04T08:54:02.2789471Z * [new branch] gh/oulgen/4/head -> origin/gh/oulgen/4/head 2025-12-04T08:54:02.2789549Z * [new branch] gh/oulgen/4/orig -> origin/gh/oulgen/4/orig 2025-12-04T08:54:02.2789632Z * [new branch] gh/oulgen/7/base -> origin/gh/oulgen/7/base 2025-12-04T08:54:02.2789729Z * [new branch] gh/oulgen/7/head -> origin/gh/oulgen/7/head 2025-12-04T08:54:02.2789811Z * [new branch] gh/oulgen/7/orig -> origin/gh/oulgen/7/orig 2025-12-04T08:54:02.2789900Z * [new branch] gh/oulgen/8/base -> origin/gh/oulgen/8/base 2025-12-04T08:54:02.2789975Z * [new branch] gh/oulgen/8/head -> origin/gh/oulgen/8/head 2025-12-04T08:54:02.2790052Z * [new branch] gh/oulgen/8/orig -> origin/gh/oulgen/8/orig 2025-12-04T08:54:02.2790135Z * [new branch] gh/oulgen/9/base -> origin/gh/oulgen/9/base 2025-12-04T08:54:02.2790227Z * [new branch] gh/oulgen/9/head -> origin/gh/oulgen/9/head 2025-12-04T08:54:02.2790309Z * [new branch] gh/oulgen/9/orig -> origin/gh/oulgen/9/orig 2025-12-04T08:54:02.2790436Z * [new branch] gh/patvig/mtia-serialization -> origin/gh/patvig/mtia-serialization 2025-12-04T08:54:02.2790516Z * [new branch] gh/pearu/108/base -> origin/gh/pearu/108/base 2025-12-04T08:54:02.2790607Z * [new branch] gh/pearu/108/head -> origin/gh/pearu/108/head 2025-12-04T08:54:02.2790683Z * [new branch] gh/pearu/108/orig -> origin/gh/pearu/108/orig 2025-12-04T08:54:02.2790771Z * [new branch] gh/pearu/109/base -> origin/gh/pearu/109/base 2025-12-04T08:54:02.2791021Z * [new branch] gh/pearu/109/head -> origin/gh/pearu/109/head 2025-12-04T08:54:02.2791099Z * [new branch] gh/pearu/109/orig -> origin/gh/pearu/109/orig 2025-12-04T08:54:02.2791207Z * [new branch] gh/pearu/110/base -> origin/gh/pearu/110/base 2025-12-04T08:54:02.2791302Z * [new branch] gh/pearu/110/head -> origin/gh/pearu/110/head 2025-12-04T08:54:02.2791374Z * [new branch] gh/pearu/110/orig -> origin/gh/pearu/110/orig 2025-12-04T08:54:02.2791463Z * [new branch] gh/pearu/111/base -> origin/gh/pearu/111/base 2025-12-04T08:54:02.2791557Z * [new branch] gh/pearu/111/head -> origin/gh/pearu/111/head 2025-12-04T08:54:02.2791634Z * [new branch] gh/pearu/111/orig -> origin/gh/pearu/111/orig 2025-12-04T08:54:02.2791709Z * [new branch] gh/pearu/112/base -> origin/gh/pearu/112/base 2025-12-04T08:54:02.2791804Z * [new branch] gh/pearu/112/head -> origin/gh/pearu/112/head 2025-12-04T08:54:02.2791876Z * [new branch] gh/pearu/112/orig -> origin/gh/pearu/112/orig 2025-12-04T08:54:02.2791966Z * [new branch] gh/pearu/115/base -> origin/gh/pearu/115/base 2025-12-04T08:54:02.2792059Z * [new branch] gh/pearu/115/head -> origin/gh/pearu/115/head 2025-12-04T08:54:02.2792135Z * [new branch] gh/pearu/115/orig -> origin/gh/pearu/115/orig 2025-12-04T08:54:02.2792261Z * [new branch] gh/pearu/116/base -> origin/gh/pearu/116/base 2025-12-04T08:54:02.2792340Z * [new branch] gh/pearu/116/head -> origin/gh/pearu/116/head 2025-12-04T08:54:02.2792412Z * [new branch] gh/pearu/116/orig -> origin/gh/pearu/116/orig 2025-12-04T08:54:02.2792515Z * [new branch] gh/pearu/117/base -> origin/gh/pearu/117/base 2025-12-04T08:54:02.2792592Z * [new branch] gh/pearu/117/head -> origin/gh/pearu/117/head 2025-12-04T08:54:02.2792675Z * [new branch] gh/pearu/117/orig -> origin/gh/pearu/117/orig 2025-12-04T08:54:02.2792766Z * [new branch] gh/pearu/118/base -> origin/gh/pearu/118/base 2025-12-04T08:54:02.2792844Z * [new branch] gh/pearu/118/head -> origin/gh/pearu/118/head 2025-12-04T08:54:02.2792916Z * [new branch] gh/pearu/118/orig -> origin/gh/pearu/118/orig 2025-12-04T08:54:02.2793026Z * [new branch] gh/pearu/119/base -> origin/gh/pearu/119/base 2025-12-04T08:54:02.2793107Z * [new branch] gh/pearu/119/head -> origin/gh/pearu/119/head 2025-12-04T08:54:02.2793184Z * [new branch] gh/pearu/119/orig -> origin/gh/pearu/119/orig 2025-12-04T08:54:02.2793275Z * [new branch] gh/pearu/139/base -> origin/gh/pearu/139/base 2025-12-04T08:54:02.2793351Z * [new branch] gh/pearu/139/head -> origin/gh/pearu/139/head 2025-12-04T08:54:02.2793423Z * [new branch] gh/pearu/139/orig -> origin/gh/pearu/139/orig 2025-12-04T08:54:02.2793529Z * [new branch] gh/pearu/140/base -> origin/gh/pearu/140/base 2025-12-04T08:54:02.2793614Z * [new branch] gh/pearu/140/head -> origin/gh/pearu/140/head 2025-12-04T08:54:02.2793703Z * [new branch] gh/pearu/140/orig -> origin/gh/pearu/140/orig 2025-12-04T08:54:02.2793779Z * [new branch] gh/pearu/142/base -> origin/gh/pearu/142/base 2025-12-04T08:54:02.2793858Z * [new branch] gh/pearu/142/head -> origin/gh/pearu/142/head 2025-12-04T08:54:02.2793953Z * [new branch] gh/pearu/142/orig -> origin/gh/pearu/142/orig 2025-12-04T08:54:02.2794039Z * [new branch] gh/pearu/143/base -> origin/gh/pearu/143/base 2025-12-04T08:54:02.2794115Z * [new branch] gh/pearu/143/head -> origin/gh/pearu/143/head 2025-12-04T08:54:02.2794203Z * [new branch] gh/pearu/143/orig -> origin/gh/pearu/143/orig 2025-12-04T08:54:02.2794308Z * [new branch] gh/pearu/147/base -> origin/gh/pearu/147/base 2025-12-04T08:54:02.2794385Z * [new branch] gh/pearu/147/head -> origin/gh/pearu/147/head 2025-12-04T08:54:02.2794487Z * [new branch] gh/pearu/147/orig -> origin/gh/pearu/147/orig 2025-12-04T08:54:02.2794568Z * [new branch] gh/pearu/149/base -> origin/gh/pearu/149/base 2025-12-04T08:54:02.2794646Z * [new branch] gh/pearu/149/head -> origin/gh/pearu/149/head 2025-12-04T08:54:02.2794736Z * [new branch] gh/pearu/149/orig -> origin/gh/pearu/149/orig 2025-12-04T08:54:02.2794812Z * [new branch] gh/pearu/150/base -> origin/gh/pearu/150/base 2025-12-04T08:54:02.2794888Z * [new branch] gh/pearu/150/head -> origin/gh/pearu/150/head 2025-12-04T08:54:02.2794987Z * [new branch] gh/pearu/150/orig -> origin/gh/pearu/150/orig 2025-12-04T08:54:02.2795072Z * [new branch] gh/pearu/151/base -> origin/gh/pearu/151/base 2025-12-04T08:54:02.2795163Z * [new branch] gh/pearu/151/head -> origin/gh/pearu/151/head 2025-12-04T08:54:02.2795239Z * [new branch] gh/pearu/151/orig -> origin/gh/pearu/151/orig 2025-12-04T08:54:02.2795316Z * [new branch] gh/pearu/152/base -> origin/gh/pearu/152/base 2025-12-04T08:54:02.2795431Z * [new branch] gh/pearu/152/head -> origin/gh/pearu/152/head 2025-12-04T08:54:02.2795519Z * [new branch] gh/pearu/152/orig -> origin/gh/pearu/152/orig 2025-12-04T08:54:02.2795601Z * [new branch] gh/pearu/153/base -> origin/gh/pearu/153/base 2025-12-04T08:54:02.2795690Z * [new branch] gh/pearu/153/head -> origin/gh/pearu/153/head 2025-12-04T08:54:02.2795766Z * [new branch] gh/pearu/153/orig -> origin/gh/pearu/153/orig 2025-12-04T08:54:02.2795848Z * [new branch] gh/pearu/154/base -> origin/gh/pearu/154/base 2025-12-04T08:54:02.2795975Z * [new branch] gh/pearu/154/head -> origin/gh/pearu/154/head 2025-12-04T08:54:02.2796066Z * [new branch] gh/pearu/154/orig -> origin/gh/pearu/154/orig 2025-12-04T08:54:02.2796148Z * [new branch] gh/pearu/155/base -> origin/gh/pearu/155/base 2025-12-04T08:54:02.2796240Z * [new branch] gh/pearu/155/head -> origin/gh/pearu/155/head 2025-12-04T08:54:02.2796316Z * [new branch] gh/pearu/155/orig -> origin/gh/pearu/155/orig 2025-12-04T08:54:02.2796397Z * [new branch] gh/pearu/156/base -> origin/gh/pearu/156/base 2025-12-04T08:54:02.2796480Z * [new branch] gh/pearu/156/head -> origin/gh/pearu/156/head 2025-12-04T08:54:02.2796568Z * [new branch] gh/pearu/156/orig -> origin/gh/pearu/156/orig 2025-12-04T08:54:02.2796663Z * [new branch] gh/pearu/56/base -> origin/gh/pearu/56/base 2025-12-04T08:54:02.2796743Z * [new branch] gh/pearu/56/head -> origin/gh/pearu/56/head 2025-12-04T08:54:02.2796824Z * [new branch] gh/pearu/56/orig -> origin/gh/pearu/56/orig 2025-12-04T08:54:02.2796914Z * [new branch] gh/pearu/97/base -> origin/gh/pearu/97/base 2025-12-04T08:54:02.2796986Z * [new branch] gh/pearu/97/head -> origin/gh/pearu/97/head 2025-12-04T08:54:02.2797073Z * [new branch] gh/pearu/97/orig -> origin/gh/pearu/97/orig 2025-12-04T08:54:02.2797177Z * [new branch] gh/pianpwk/21/base -> origin/gh/pianpwk/21/base 2025-12-04T08:54:02.2797259Z * [new branch] gh/pianpwk/21/head -> origin/gh/pianpwk/21/head 2025-12-04T08:54:02.2797348Z * [new branch] gh/pianpwk/28/base -> origin/gh/pianpwk/28/base 2025-12-04T08:54:02.2797440Z * [new branch] gh/pianpwk/28/head -> origin/gh/pianpwk/28/head 2025-12-04T08:54:02.2797555Z * [new branch] gh/pianpwk/28/orig -> origin/gh/pianpwk/28/orig 2025-12-04T08:54:02.2797643Z * [new branch] gh/pianpwk/29/base -> origin/gh/pianpwk/29/base 2025-12-04T08:54:02.2797746Z * [new branch] gh/pianpwk/29/head -> origin/gh/pianpwk/29/head 2025-12-04T08:54:02.2797826Z * [new branch] gh/pianpwk/29/orig -> origin/gh/pianpwk/29/orig 2025-12-04T08:54:02.2797905Z * [new branch] gh/pianpwk/30/base -> origin/gh/pianpwk/30/base 2025-12-04T08:54:02.2798002Z * [new branch] gh/pianpwk/30/head -> origin/gh/pianpwk/30/head 2025-12-04T08:54:02.2798077Z * [new branch] gh/pianpwk/30/orig -> origin/gh/pianpwk/30/orig 2025-12-04T08:54:02.2798184Z * [new branch] gh/pianpwk/31/base -> origin/gh/pianpwk/31/base 2025-12-04T08:54:02.2798269Z * [new branch] gh/pianpwk/31/head -> origin/gh/pianpwk/31/head 2025-12-04T08:54:02.2798349Z * [new branch] gh/pianpwk/31/orig -> origin/gh/pianpwk/31/orig 2025-12-04T08:54:02.2798439Z * [new branch] gh/pianpwk/32/base -> origin/gh/pianpwk/32/base 2025-12-04T08:54:02.2798519Z * [new branch] gh/pianpwk/32/head -> origin/gh/pianpwk/32/head 2025-12-04T08:54:02.2798637Z * [new branch] gh/pianpwk/32/orig -> origin/gh/pianpwk/32/orig 2025-12-04T08:54:02.2802399Z * [new branch] gh/pianpwk/33/base -> origin/gh/pianpwk/33/base 2025-12-04T08:54:02.2802485Z * [new branch] gh/pianpwk/33/head -> origin/gh/pianpwk/33/head 2025-12-04T08:54:02.2802558Z * [new branch] gh/pianpwk/33/orig -> origin/gh/pianpwk/33/orig 2025-12-04T08:54:02.2802627Z * [new branch] gh/pianpwk/34/base -> origin/gh/pianpwk/34/base 2025-12-04T08:54:02.2802695Z * [new branch] gh/pianpwk/34/head -> origin/gh/pianpwk/34/head 2025-12-04T08:54:02.2802769Z * [new branch] gh/pianpwk/34/orig -> origin/gh/pianpwk/34/orig 2025-12-04T08:54:02.2802836Z * [new branch] gh/pianpwk/35/base -> origin/gh/pianpwk/35/base 2025-12-04T08:54:02.2802906Z * [new branch] gh/pianpwk/35/head -> origin/gh/pianpwk/35/head 2025-12-04T08:54:02.2802979Z * [new branch] gh/pianpwk/35/orig -> origin/gh/pianpwk/35/orig 2025-12-04T08:54:02.2803045Z * [new branch] gh/rec/141/base -> origin/gh/rec/141/base 2025-12-04T08:54:02.2803111Z * [new branch] gh/rec/141/head -> origin/gh/rec/141/head 2025-12-04T08:54:02.2803174Z * [new branch] gh/rec/153/base -> origin/gh/rec/153/base 2025-12-04T08:54:02.2803237Z * [new branch] gh/rec/153/head -> origin/gh/rec/153/head 2025-12-04T08:54:02.2803300Z * [new branch] gh/rec/153/orig -> origin/gh/rec/153/orig 2025-12-04T08:54:02.2803366Z * [new branch] gh/rec/154/base -> origin/gh/rec/154/base 2025-12-04T08:54:02.2803428Z * [new branch] gh/rec/154/head -> origin/gh/rec/154/head 2025-12-04T08:54:02.2803492Z * [new branch] gh/rec/154/orig -> origin/gh/rec/154/orig 2025-12-04T08:54:02.2803555Z * [new branch] gh/rec/164/base -> origin/gh/rec/164/base 2025-12-04T08:54:02.2803618Z * [new branch] gh/rec/164/head -> origin/gh/rec/164/head 2025-12-04T08:54:02.2803681Z * [new branch] gh/rec/164/orig -> origin/gh/rec/164/orig 2025-12-04T08:54:02.2803742Z * [new branch] gh/rec/166/base -> origin/gh/rec/166/base 2025-12-04T08:54:02.2803804Z * [new branch] gh/rec/166/head -> origin/gh/rec/166/head 2025-12-04T08:54:02.2803869Z * [new branch] gh/rec/166/orig -> origin/gh/rec/166/orig 2025-12-04T08:54:02.2803985Z * [new branch] gh/rec/167/base -> origin/gh/rec/167/base 2025-12-04T08:54:02.2804046Z * [new branch] gh/rec/167/head -> origin/gh/rec/167/head 2025-12-04T08:54:02.2804110Z * [new branch] gh/rec/167/orig -> origin/gh/rec/167/orig 2025-12-04T08:54:02.2804171Z * [new branch] gh/rec/168/base -> origin/gh/rec/168/base 2025-12-04T08:54:02.2804236Z * [new branch] gh/rec/168/head -> origin/gh/rec/168/head 2025-12-04T08:54:02.2804302Z * [new branch] gh/rec/168/orig -> origin/gh/rec/168/orig 2025-12-04T08:54:02.2804363Z * [new branch] gh/rec/169/base -> origin/gh/rec/169/base 2025-12-04T08:54:02.2804425Z * [new branch] gh/rec/169/head -> origin/gh/rec/169/head 2025-12-04T08:54:02.2804491Z * [new branch] gh/rec/169/orig -> origin/gh/rec/169/orig 2025-12-04T08:54:02.2804554Z * [new branch] gh/rec/170/base -> origin/gh/rec/170/base 2025-12-04T08:54:02.2804619Z * [new branch] gh/rec/170/head -> origin/gh/rec/170/head 2025-12-04T08:54:02.2804681Z * [new branch] gh/rec/170/orig -> origin/gh/rec/170/orig 2025-12-04T08:54:02.2804743Z * [new branch] gh/rec/171/base -> origin/gh/rec/171/base 2025-12-04T08:54:02.2804833Z * [new branch] gh/rec/171/head -> origin/gh/rec/171/head 2025-12-04T08:54:02.2804896Z * [new branch] gh/rec/171/orig -> origin/gh/rec/171/orig 2025-12-04T08:54:02.2804958Z * [new branch] gh/rec/172/base -> origin/gh/rec/172/base 2025-12-04T08:54:02.2805021Z * [new branch] gh/rec/172/head -> origin/gh/rec/172/head 2025-12-04T08:54:02.2805083Z * [new branch] gh/rec/172/orig -> origin/gh/rec/172/orig 2025-12-04T08:54:02.2805145Z * [new branch] gh/rec/173/base -> origin/gh/rec/173/base 2025-12-04T08:54:02.2805209Z * [new branch] gh/rec/173/head -> origin/gh/rec/173/head 2025-12-04T08:54:02.2805271Z * [new branch] gh/rec/173/orig -> origin/gh/rec/173/orig 2025-12-04T08:54:02.2805332Z * [new branch] gh/rec/174/base -> origin/gh/rec/174/base 2025-12-04T08:54:02.2805395Z * [new branch] gh/rec/174/head -> origin/gh/rec/174/head 2025-12-04T08:54:02.2805458Z * [new branch] gh/rec/174/orig -> origin/gh/rec/174/orig 2025-12-04T08:54:02.2805520Z * [new branch] gh/rec/175/base -> origin/gh/rec/175/base 2025-12-04T08:54:02.2805583Z * [new branch] gh/rec/175/head -> origin/gh/rec/175/head 2025-12-04T08:54:02.2805646Z * [new branch] gh/rec/175/orig -> origin/gh/rec/175/orig 2025-12-04T08:54:02.2805708Z * [new branch] gh/rec/176/base -> origin/gh/rec/176/base 2025-12-04T08:54:02.2805774Z * [new branch] gh/rec/176/head -> origin/gh/rec/176/head 2025-12-04T08:54:02.2805837Z * [new branch] gh/rec/176/orig -> origin/gh/rec/176/orig 2025-12-04T08:54:02.2805898Z * [new branch] gh/rec/177/base -> origin/gh/rec/177/base 2025-12-04T08:54:02.2806004Z * [new branch] gh/rec/177/head -> origin/gh/rec/177/head 2025-12-04T08:54:02.2806071Z * [new branch] gh/rec/177/orig -> origin/gh/rec/177/orig 2025-12-04T08:54:02.2806162Z * [new branch] gh/robert-hardwick/3/base -> origin/gh/robert-hardwick/3/base 2025-12-04T08:54:02.2806248Z * [new branch] gh/robert-hardwick/3/head -> origin/gh/robert-hardwick/3/head 2025-12-04T08:54:02.2806329Z * [new branch] gh/robert-hardwick/3/orig -> origin/gh/robert-hardwick/3/orig 2025-12-04T08:54:02.2806412Z * [new branch] gh/robert-hardwick/4/base -> origin/gh/robert-hardwick/4/base 2025-12-04T08:54:02.2806542Z * [new branch] gh/robert-hardwick/4/head -> origin/gh/robert-hardwick/4/head 2025-12-04T08:54:02.2806621Z * [new branch] gh/robert-hardwick/4/orig -> origin/gh/robert-hardwick/4/orig 2025-12-04T08:54:02.2806702Z * [new branch] gh/robert-hardwick/5/base -> origin/gh/robert-hardwick/5/base 2025-12-04T08:54:02.2806784Z * [new branch] gh/robert-hardwick/5/head -> origin/gh/robert-hardwick/5/head 2025-12-04T08:54:02.2806864Z * [new branch] gh/robert-hardwick/5/orig -> origin/gh/robert-hardwick/5/orig 2025-12-04T08:54:02.2806946Z * [new branch] gh/robert-hardwick/6/base -> origin/gh/robert-hardwick/6/base 2025-12-04T08:54:02.2807026Z * [new branch] gh/robert-hardwick/6/head -> origin/gh/robert-hardwick/6/head 2025-12-04T08:54:02.2807105Z * [new branch] gh/robert-hardwick/6/orig -> origin/gh/robert-hardwick/6/orig 2025-12-04T08:54:02.2807187Z * [new branch] gh/robert-hardwick/7/base -> origin/gh/robert-hardwick/7/base 2025-12-04T08:54:02.2807269Z * [new branch] gh/robert-hardwick/7/head -> origin/gh/robert-hardwick/7/head 2025-12-04T08:54:02.2807349Z * [new branch] gh/robert-hardwick/7/orig -> origin/gh/robert-hardwick/7/orig 2025-12-04T08:54:02.2807429Z * [new branch] gh/robert-hardwick/8/base -> origin/gh/robert-hardwick/8/base 2025-12-04T08:54:02.2807566Z * [new branch] gh/robert-hardwick/8/head -> origin/gh/robert-hardwick/8/head 2025-12-04T08:54:02.2807647Z * [new branch] gh/robert-hardwick/8/orig -> origin/gh/robert-hardwick/8/orig 2025-12-04T08:54:02.2807728Z * [new branch] gh/robert-hardwick/9/base -> origin/gh/robert-hardwick/9/base 2025-12-04T08:54:02.2807808Z * [new branch] gh/robert-hardwick/9/head -> origin/gh/robert-hardwick/9/head 2025-12-04T08:54:02.2807887Z * [new branch] gh/robert-hardwick/9/orig -> origin/gh/robert-hardwick/9/orig 2025-12-04T08:54:02.2807962Z * [new branch] gh/rtimpe/1/base -> origin/gh/rtimpe/1/base 2025-12-04T08:54:02.2808030Z * [new branch] gh/rtimpe/1/head -> origin/gh/rtimpe/1/head 2025-12-04T08:54:02.2808097Z * [new branch] gh/rtimpe/2/base -> origin/gh/rtimpe/2/base 2025-12-04T08:54:02.2808162Z * [new branch] gh/rtimpe/2/head -> origin/gh/rtimpe/2/head 2025-12-04T08:54:02.2808231Z * [new branch] gh/rtimpe/22/base -> origin/gh/rtimpe/22/base 2025-12-04T08:54:02.2808300Z * [new branch] gh/rtimpe/22/head -> origin/gh/rtimpe/22/head 2025-12-04T08:54:02.2808366Z * [new branch] gh/rtimpe/22/orig -> origin/gh/rtimpe/22/orig 2025-12-04T08:54:02.2808435Z * [new branch] gh/rtimpe/23/base -> origin/gh/rtimpe/23/base 2025-12-04T08:54:02.2808501Z * [new branch] gh/rtimpe/23/head -> origin/gh/rtimpe/23/head 2025-12-04T08:54:02.2808570Z * [new branch] gh/rtimpe/23/orig -> origin/gh/rtimpe/23/orig 2025-12-04T08:54:02.2808635Z * [new branch] gh/rtimpe/24/base -> origin/gh/rtimpe/24/base 2025-12-04T08:54:02.2808702Z * [new branch] gh/rtimpe/24/head -> origin/gh/rtimpe/24/head 2025-12-04T08:54:02.2808771Z * [new branch] gh/rtimpe/24/orig -> origin/gh/rtimpe/24/orig 2025-12-04T08:54:02.2808839Z * [new branch] gh/rtimpe/25/base -> origin/gh/rtimpe/25/base 2025-12-04T08:54:02.2808904Z * [new branch] gh/rtimpe/25/head -> origin/gh/rtimpe/25/head 2025-12-04T08:54:02.2808971Z * [new branch] gh/rtimpe/25/orig -> origin/gh/rtimpe/25/orig 2025-12-04T08:54:02.2809036Z * [new branch] gh/rtimpe/26/base -> origin/gh/rtimpe/26/base 2025-12-04T08:54:02.2809102Z * [new branch] gh/rtimpe/26/head -> origin/gh/rtimpe/26/head 2025-12-04T08:54:02.2809195Z * [new branch] gh/rtimpe/26/orig -> origin/gh/rtimpe/26/orig 2025-12-04T08:54:02.2809261Z * [new branch] gh/rtimpe/27/base -> origin/gh/rtimpe/27/base 2025-12-04T08:54:02.2809328Z * [new branch] gh/rtimpe/27/head -> origin/gh/rtimpe/27/head 2025-12-04T08:54:02.2809393Z * [new branch] gh/rtimpe/27/orig -> origin/gh/rtimpe/27/orig 2025-12-04T08:54:02.2809461Z * [new branch] gh/rtimpe/28/base -> origin/gh/rtimpe/28/base 2025-12-04T08:54:02.2809529Z * [new branch] gh/rtimpe/28/head -> origin/gh/rtimpe/28/head 2025-12-04T08:54:02.2809594Z * [new branch] gh/rtimpe/28/orig -> origin/gh/rtimpe/28/orig 2025-12-04T08:54:02.2809660Z * [new branch] gh/rtimpe/29/base -> origin/gh/rtimpe/29/base 2025-12-04T08:54:02.2809727Z * [new branch] gh/rtimpe/29/head -> origin/gh/rtimpe/29/head 2025-12-04T08:54:02.2809792Z * [new branch] gh/rtimpe/29/orig -> origin/gh/rtimpe/29/orig 2025-12-04T08:54:02.2809861Z * [new branch] gh/rtimpe/3/base -> origin/gh/rtimpe/3/base 2025-12-04T08:54:02.2809929Z * [new branch] gh/rtimpe/3/head -> origin/gh/rtimpe/3/head 2025-12-04T08:54:02.2809995Z * [new branch] gh/rtimpe/30/base -> origin/gh/rtimpe/30/base 2025-12-04T08:54:02.2810095Z * [new branch] gh/rtimpe/30/head -> origin/gh/rtimpe/30/head 2025-12-04T08:54:02.2810162Z * [new branch] gh/rtimpe/30/orig -> origin/gh/rtimpe/30/orig 2025-12-04T08:54:02.2810228Z * [new branch] gh/rtimpe/31/base -> origin/gh/rtimpe/31/base 2025-12-04T08:54:02.2810293Z * [new branch] gh/rtimpe/31/head -> origin/gh/rtimpe/31/head 2025-12-04T08:54:02.2810360Z * [new branch] gh/rtimpe/31/orig -> origin/gh/rtimpe/31/orig 2025-12-04T08:54:02.2810425Z * [new branch] gh/rtimpe/32/base -> origin/gh/rtimpe/32/base 2025-12-04T08:54:02.2810492Z * [new branch] gh/rtimpe/32/head -> origin/gh/rtimpe/32/head 2025-12-04T08:54:02.2810559Z * [new branch] gh/rtimpe/32/orig -> origin/gh/rtimpe/32/orig 2025-12-04T08:54:02.2810624Z * [new branch] gh/rtimpe/33/base -> origin/gh/rtimpe/33/base 2025-12-04T08:54:02.2810691Z * [new branch] gh/rtimpe/33/head -> origin/gh/rtimpe/33/head 2025-12-04T08:54:02.2810757Z * [new branch] gh/rtimpe/33/orig -> origin/gh/rtimpe/33/orig 2025-12-04T08:54:02.2810822Z * [new branch] gh/rtimpe/34/base -> origin/gh/rtimpe/34/base 2025-12-04T08:54:02.2810886Z * [new branch] gh/rtimpe/34/head -> origin/gh/rtimpe/34/head 2025-12-04T08:54:02.2810953Z * [new branch] gh/rtimpe/34/orig -> origin/gh/rtimpe/34/orig 2025-12-04T08:54:02.2811018Z * [new branch] gh/rtimpe/35/base -> origin/gh/rtimpe/35/base 2025-12-04T08:54:02.2811087Z * [new branch] gh/rtimpe/35/head -> origin/gh/rtimpe/35/head 2025-12-04T08:54:02.2811152Z * [new branch] gh/rtimpe/35/orig -> origin/gh/rtimpe/35/orig 2025-12-04T08:54:02.2811217Z * [new branch] gh/rtimpe/4/base -> origin/gh/rtimpe/4/base 2025-12-04T08:54:02.2811286Z * [new branch] gh/rtimpe/4/head -> origin/gh/rtimpe/4/head 2025-12-04T08:54:02.2811366Z * [new branch] gh/ruisizhang123/1/base -> origin/gh/ruisizhang123/1/base 2025-12-04T08:54:02.2811443Z * [new branch] gh/ruisizhang123/1/head -> origin/gh/ruisizhang123/1/head 2025-12-04T08:54:02.2811519Z * [new branch] gh/ruisizhang123/1/orig -> origin/gh/ruisizhang123/1/orig 2025-12-04T08:54:02.2811595Z * [new branch] gh/ruisizhang123/4/base -> origin/gh/ruisizhang123/4/base 2025-12-04T08:54:02.2811670Z * [new branch] gh/ruisizhang123/4/head -> origin/gh/ruisizhang123/4/head 2025-12-04T08:54:02.2811771Z * [new branch] gh/ruisizhang123/4/orig -> origin/gh/ruisizhang123/4/orig 2025-12-04T08:54:02.2811846Z * [new branch] gh/ruisizhang123/5/base -> origin/gh/ruisizhang123/5/base 2025-12-04T08:54:02.2811921Z * [new branch] gh/ruisizhang123/5/head -> origin/gh/ruisizhang123/5/head 2025-12-04T08:54:02.2812000Z * [new branch] gh/ruisizhang123/5/orig -> origin/gh/ruisizhang123/5/orig 2025-12-04T08:54:02.2812074Z * [new branch] gh/ruisizhang123/6/base -> origin/gh/ruisizhang123/6/base 2025-12-04T08:54:02.2812148Z * [new branch] gh/ruisizhang123/6/head -> origin/gh/ruisizhang123/6/head 2025-12-04T08:54:02.2812224Z * [new branch] gh/ruisizhang123/6/orig -> origin/gh/ruisizhang123/6/orig 2025-12-04T08:54:02.2812299Z * [new branch] gh/ruisizhang123/7/base -> origin/gh/ruisizhang123/7/base 2025-12-04T08:54:02.2812376Z * [new branch] gh/ruisizhang123/7/head -> origin/gh/ruisizhang123/7/head 2025-12-04T08:54:02.2812451Z * [new branch] gh/ruisizhang123/7/orig -> origin/gh/ruisizhang123/7/orig 2025-12-04T08:54:02.2812524Z * [new branch] gh/ruisizhang123/8/base -> origin/gh/ruisizhang123/8/base 2025-12-04T08:54:02.2812599Z * [new branch] gh/ruisizhang123/8/head -> origin/gh/ruisizhang123/8/head 2025-12-04T08:54:02.2812710Z * [new branch] gh/ruisizhang123/8/orig -> origin/gh/ruisizhang123/8/orig 2025-12-04T08:54:02.2812785Z * [new branch] gh/ruisizhang123/9/base -> origin/gh/ruisizhang123/9/base 2025-12-04T08:54:02.2812860Z * [new branch] gh/ruisizhang123/9/head -> origin/gh/ruisizhang123/9/head 2025-12-04T08:54:02.2812933Z * [new branch] gh/ruisizhang123/9/orig -> origin/gh/ruisizhang123/9/orig 2025-12-04T08:54:02.2813010Z * [new branch] gh/seemethere/52/base -> origin/gh/seemethere/52/base 2025-12-04T08:54:02.2813086Z * [new branch] gh/seemethere/52/head -> origin/gh/seemethere/52/head 2025-12-04T08:54:02.2813160Z * [new branch] gh/seemethere/52/orig -> origin/gh/seemethere/52/orig 2025-12-04T08:54:02.2813233Z * [new branch] gh/seemethere/53/base -> origin/gh/seemethere/53/base 2025-12-04T08:54:02.2813308Z * [new branch] gh/seemethere/53/head -> origin/gh/seemethere/53/head 2025-12-04T08:54:02.2813380Z * [new branch] gh/seemethere/53/orig -> origin/gh/seemethere/53/orig 2025-12-04T08:54:02.2813451Z * [new branch] gh/seemethere/54/base -> origin/gh/seemethere/54/base 2025-12-04T08:54:02.2813524Z * [new branch] gh/seemethere/54/head -> origin/gh/seemethere/54/head 2025-12-04T08:54:02.2813594Z * [new branch] gh/seemethere/54/orig -> origin/gh/seemethere/54/orig 2025-12-04T08:54:02.2813666Z * [new branch] gh/seemethere/55/base -> origin/gh/seemethere/55/base 2025-12-04T08:54:02.2813741Z * [new branch] gh/seemethere/55/head -> origin/gh/seemethere/55/head 2025-12-04T08:54:02.2813813Z * [new branch] gh/seemethere/55/orig -> origin/gh/seemethere/55/orig 2025-12-04T08:54:02.2813885Z * [new branch] gh/seemethere/59/base -> origin/gh/seemethere/59/base 2025-12-04T08:54:02.2813959Z * [new branch] gh/seemethere/59/head -> origin/gh/seemethere/59/head 2025-12-04T08:54:02.2814030Z * [new branch] gh/seemethere/59/orig -> origin/gh/seemethere/59/orig 2025-12-04T08:54:02.2814104Z * [new branch] gh/seemethere/62/base -> origin/gh/seemethere/62/base 2025-12-04T08:54:02.2814176Z * [new branch] gh/seemethere/62/head -> origin/gh/seemethere/62/head 2025-12-04T08:54:02.2814247Z * [new branch] gh/seemethere/62/orig -> origin/gh/seemethere/62/orig 2025-12-04T08:54:02.2814323Z * [new branch] gh/seemethere/63/base -> origin/gh/seemethere/63/base 2025-12-04T08:54:02.2814420Z * [new branch] gh/seemethere/63/head -> origin/gh/seemethere/63/head 2025-12-04T08:54:02.2814492Z * [new branch] gh/seemethere/63/orig -> origin/gh/seemethere/63/orig 2025-12-04T08:54:02.2814567Z * [new branch] gh/seemethere/71/base -> origin/gh/seemethere/71/base 2025-12-04T08:54:02.2814641Z * [new branch] gh/seemethere/71/head -> origin/gh/seemethere/71/head 2025-12-04T08:54:02.2814714Z * [new branch] gh/seemethere/71/orig -> origin/gh/seemethere/71/orig 2025-12-04T08:54:02.2814789Z * [new branch] gh/seemethere/72/base -> origin/gh/seemethere/72/base 2025-12-04T08:54:02.2814861Z * [new branch] gh/seemethere/72/head -> origin/gh/seemethere/72/head 2025-12-04T08:54:02.2814932Z * [new branch] gh/seemethere/72/orig -> origin/gh/seemethere/72/orig 2025-12-04T08:54:02.2815007Z * [new branch] gh/seemethere/73/base -> origin/gh/seemethere/73/base 2025-12-04T08:54:02.2815080Z * [new branch] gh/seemethere/73/head -> origin/gh/seemethere/73/head 2025-12-04T08:54:02.2815152Z * [new branch] gh/seemethere/73/orig -> origin/gh/seemethere/73/orig 2025-12-04T08:54:02.2815223Z * [new branch] gh/seemethere/74/base -> origin/gh/seemethere/74/base 2025-12-04T08:54:02.2815325Z * [new branch] gh/seemethere/74/head -> origin/gh/seemethere/74/head 2025-12-04T08:54:02.2815396Z * [new branch] gh/seemethere/74/orig -> origin/gh/seemethere/74/orig 2025-12-04T08:54:02.2815470Z * [new branch] gh/seemethere/75/base -> origin/gh/seemethere/75/base 2025-12-04T08:54:02.2815541Z * [new branch] gh/seemethere/75/head -> origin/gh/seemethere/75/head 2025-12-04T08:54:02.2815613Z * [new branch] gh/seemethere/75/orig -> origin/gh/seemethere/75/orig 2025-12-04T08:54:02.2815691Z * [new branch] gh/seemethere/76/base -> origin/gh/seemethere/76/base 2025-12-04T08:54:02.2815763Z * [new branch] gh/seemethere/76/head -> origin/gh/seemethere/76/head 2025-12-04T08:54:02.2815835Z * [new branch] gh/seemethere/76/orig -> origin/gh/seemethere/76/orig 2025-12-04T08:54:02.2815910Z * [new branch] gh/shunting314/145/base -> origin/gh/shunting314/145/base 2025-12-04T08:54:02.2816024Z * [new branch] gh/shunting314/145/head -> origin/gh/shunting314/145/head 2025-12-04T08:54:02.2816103Z * [new branch] gh/shunting314/145/orig -> origin/gh/shunting314/145/orig 2025-12-04T08:54:02.2816179Z * [new branch] gh/shunting314/176/base -> origin/gh/shunting314/176/base 2025-12-04T08:54:02.2816254Z * [new branch] gh/shunting314/176/head -> origin/gh/shunting314/176/head 2025-12-04T08:54:02.2816328Z * [new branch] gh/shunting314/176/orig -> origin/gh/shunting314/176/orig 2025-12-04T08:54:02.2816404Z * [new branch] gh/shunting314/249/base -> origin/gh/shunting314/249/base 2025-12-04T08:54:02.2816477Z * [new branch] gh/shunting314/249/head -> origin/gh/shunting314/249/head 2025-12-04T08:54:02.2816552Z * [new branch] gh/shunting314/249/orig -> origin/gh/shunting314/249/orig 2025-12-04T08:54:02.2816627Z * [new branch] gh/shunting314/253/base -> origin/gh/shunting314/253/base 2025-12-04T08:54:02.2816701Z * [new branch] gh/shunting314/253/head -> origin/gh/shunting314/253/head 2025-12-04T08:54:02.2816777Z * [new branch] gh/shunting314/253/orig -> origin/gh/shunting314/253/orig 2025-12-04T08:54:02.2816849Z * [new branch] gh/shunting314/256/base -> origin/gh/shunting314/256/base 2025-12-04T08:54:02.2816922Z * [new branch] gh/shunting314/256/head -> origin/gh/shunting314/256/head 2025-12-04T08:54:02.2816996Z * [new branch] gh/shunting314/256/orig -> origin/gh/shunting314/256/orig 2025-12-04T08:54:02.2817119Z * [new branch] gh/shunting314/257/base -> origin/gh/shunting314/257/base 2025-12-04T08:54:02.2817193Z * [new branch] gh/shunting314/257/head -> origin/gh/shunting314/257/head 2025-12-04T08:54:02.2817269Z * [new branch] gh/shunting314/257/orig -> origin/gh/shunting314/257/orig 2025-12-04T08:54:02.2817344Z * [new branch] gh/shunting314/258/base -> origin/gh/shunting314/258/base 2025-12-04T08:54:02.2817420Z * [new branch] gh/shunting314/258/head -> origin/gh/shunting314/258/head 2025-12-04T08:54:02.2817494Z * [new branch] gh/shunting314/258/orig -> origin/gh/shunting314/258/orig 2025-12-04T08:54:02.2817568Z * [new branch] gh/shunting314/259/base -> origin/gh/shunting314/259/base 2025-12-04T08:54:02.2817645Z * [new branch] gh/shunting314/259/head -> origin/gh/shunting314/259/head 2025-12-04T08:54:02.2817717Z * [new branch] gh/shunting314/259/orig -> origin/gh/shunting314/259/orig 2025-12-04T08:54:02.2817793Z * [new branch] gh/shunting314/260/base -> origin/gh/shunting314/260/base 2025-12-04T08:54:02.2817869Z * [new branch] gh/shunting314/260/head -> origin/gh/shunting314/260/head 2025-12-04T08:54:02.2817941Z * [new branch] gh/shunting314/260/orig -> origin/gh/shunting314/260/orig 2025-12-04T08:54:02.2818059Z * [new branch] gh/shunting314/261/base -> origin/gh/shunting314/261/base 2025-12-04T08:54:02.2818134Z * [new branch] gh/shunting314/261/head -> origin/gh/shunting314/261/head 2025-12-04T08:54:02.2818208Z * [new branch] gh/shunting314/261/orig -> origin/gh/shunting314/261/orig 2025-12-04T08:54:02.2818283Z * [new branch] gh/shunting314/262/base -> origin/gh/shunting314/262/base 2025-12-04T08:54:02.2818357Z * [new branch] gh/shunting314/262/head -> origin/gh/shunting314/262/head 2025-12-04T08:54:02.2818431Z * [new branch] gh/shunting314/262/orig -> origin/gh/shunting314/262/orig 2025-12-04T08:54:02.2818505Z * [new branch] gh/shunting314/263/base -> origin/gh/shunting314/263/base 2025-12-04T08:54:02.2818579Z * [new branch] gh/shunting314/263/head -> origin/gh/shunting314/263/head 2025-12-04T08:54:02.2818652Z * [new branch] gh/shunting314/263/orig -> origin/gh/shunting314/263/orig 2025-12-04T08:54:02.2818725Z * [new branch] gh/shunting314/264/base -> origin/gh/shunting314/264/base 2025-12-04T08:54:02.2818800Z * [new branch] gh/shunting314/264/head -> origin/gh/shunting314/264/head 2025-12-04T08:54:02.2818873Z * [new branch] gh/shunting314/264/orig -> origin/gh/shunting314/264/orig 2025-12-04T08:54:02.2818946Z * [new branch] gh/shunting314/265/base -> origin/gh/shunting314/265/base 2025-12-04T08:54:02.2819019Z * [new branch] gh/shunting314/265/head -> origin/gh/shunting314/265/head 2025-12-04T08:54:02.2819095Z * [new branch] gh/shunting314/265/orig -> origin/gh/shunting314/265/orig 2025-12-04T08:54:02.2819170Z * [new branch] gh/shunting314/266/base -> origin/gh/shunting314/266/base 2025-12-04T08:54:02.2819245Z * [new branch] gh/shunting314/266/head -> origin/gh/shunting314/266/head 2025-12-04T08:54:02.2819320Z * [new branch] gh/shunting314/266/orig -> origin/gh/shunting314/266/orig 2025-12-04T08:54:02.2819395Z * [new branch] gh/shunting314/267/base -> origin/gh/shunting314/267/base 2025-12-04T08:54:02.2819468Z * [new branch] gh/shunting314/267/head -> origin/gh/shunting314/267/head 2025-12-04T08:54:02.2819542Z * [new branch] gh/shunting314/267/orig -> origin/gh/shunting314/267/orig 2025-12-04T08:54:02.2819620Z * [new branch] gh/shunting314/268/base -> origin/gh/shunting314/268/base 2025-12-04T08:54:02.2819695Z * [new branch] gh/shunting314/268/head -> origin/gh/shunting314/268/head 2025-12-04T08:54:02.2819799Z * [new branch] gh/shunting314/268/orig -> origin/gh/shunting314/268/orig 2025-12-04T08:54:02.2819875Z * [new branch] gh/shunting314/269/base -> origin/gh/shunting314/269/base 2025-12-04T08:54:02.2819949Z * [new branch] gh/shunting314/269/head -> origin/gh/shunting314/269/head 2025-12-04T08:54:02.2820024Z * [new branch] gh/shunting314/269/orig -> origin/gh/shunting314/269/orig 2025-12-04T08:54:02.2820099Z * [new branch] gh/silverguo/1/base -> origin/gh/silverguo/1/base 2025-12-04T08:54:02.2820170Z * [new branch] gh/silverguo/1/head -> origin/gh/silverguo/1/head 2025-12-04T08:54:02.2820241Z * [new branch] gh/silverguo/2/base -> origin/gh/silverguo/2/base 2025-12-04T08:54:02.2820314Z * [new branch] gh/silverguo/2/head -> origin/gh/silverguo/2/head 2025-12-04T08:54:02.2820386Z * [new branch] gh/silverguo/3/base -> origin/gh/silverguo/3/base 2025-12-04T08:54:02.2820454Z * [new branch] gh/silverguo/3/head -> origin/gh/silverguo/3/head 2025-12-04T08:54:02.2820523Z * [new branch] gh/silverguo/4/base -> origin/gh/silverguo/4/base 2025-12-04T08:54:02.2820594Z * [new branch] gh/silverguo/4/head -> origin/gh/silverguo/4/head 2025-12-04T08:54:02.2820698Z * [new branch] gh/slayton58/39/base -> origin/gh/slayton58/39/base 2025-12-04T08:54:02.2820770Z * [new branch] gh/slayton58/39/head -> origin/gh/slayton58/39/head 2025-12-04T08:54:02.2820839Z * [new branch] gh/slayton58/39/orig -> origin/gh/slayton58/39/orig 2025-12-04T08:54:02.2820912Z * [new branch] gh/slayton58/42/base -> origin/gh/slayton58/42/base 2025-12-04T08:54:02.2820980Z * [new branch] gh/slayton58/42/head -> origin/gh/slayton58/42/head 2025-12-04T08:54:02.2821049Z * [new branch] gh/slayton58/42/orig -> origin/gh/slayton58/42/orig 2025-12-04T08:54:02.2821120Z * [new branch] gh/slayton58/43/base -> origin/gh/slayton58/43/base 2025-12-04T08:54:02.2821188Z * [new branch] gh/slayton58/43/head -> origin/gh/slayton58/43/head 2025-12-04T08:54:02.2821258Z * [new branch] gh/slayton58/43/orig -> origin/gh/slayton58/43/orig 2025-12-04T08:54:02.2821331Z * [new branch] gh/slayton58/44/base -> origin/gh/slayton58/44/base 2025-12-04T08:54:02.2821400Z * [new branch] gh/slayton58/44/head -> origin/gh/slayton58/44/head 2025-12-04T08:54:02.2821468Z * [new branch] gh/slayton58/44/orig -> origin/gh/slayton58/44/orig 2025-12-04T08:54:02.2821540Z * [new branch] gh/slayton58/45/base -> origin/gh/slayton58/45/base 2025-12-04T08:54:02.2821610Z * [new branch] gh/slayton58/45/head -> origin/gh/slayton58/45/head 2025-12-04T08:54:02.2821681Z * [new branch] gh/slayton58/45/orig -> origin/gh/slayton58/45/orig 2025-12-04T08:54:02.2821755Z * [new branch] gh/slayton58/46/base -> origin/gh/slayton58/46/base 2025-12-04T08:54:02.2821825Z * [new branch] gh/slayton58/46/head -> origin/gh/slayton58/46/head 2025-12-04T08:54:02.2821895Z * [new branch] gh/slayton58/46/orig -> origin/gh/slayton58/46/orig 2025-12-04T08:54:02.2821966Z * [new branch] gh/slayton58/6/base -> origin/gh/slayton58/6/base 2025-12-04T08:54:02.2822035Z * [new branch] gh/slayton58/6/head -> origin/gh/slayton58/6/head 2025-12-04T08:54:02.2822106Z * [new branch] gh/slayton58/7/base -> origin/gh/slayton58/7/base 2025-12-04T08:54:02.2822175Z * [new branch] gh/slayton58/7/head -> origin/gh/slayton58/7/head 2025-12-04T08:54:02.2822249Z * [new branch] gh/soulitzer/269/base -> origin/gh/soulitzer/269/base 2025-12-04T08:54:02.2822352Z * [new branch] gh/soulitzer/269/head -> origin/gh/soulitzer/269/head 2025-12-04T08:54:02.2822422Z * [new branch] gh/soulitzer/269/orig -> origin/gh/soulitzer/269/orig 2025-12-04T08:54:02.2822495Z * [new branch] gh/soulitzer/276/base -> origin/gh/soulitzer/276/base 2025-12-04T08:54:02.2822567Z * [new branch] gh/soulitzer/276/head -> origin/gh/soulitzer/276/head 2025-12-04T08:54:02.2822639Z * [new branch] gh/soulitzer/276/orig -> origin/gh/soulitzer/276/orig 2025-12-04T08:54:02.2822710Z * [new branch] gh/soulitzer/287/base -> origin/gh/soulitzer/287/base 2025-12-04T08:54:02.2822782Z * [new branch] gh/soulitzer/287/head -> origin/gh/soulitzer/287/head 2025-12-04T08:54:02.2822853Z * [new branch] gh/soulitzer/287/orig -> origin/gh/soulitzer/287/orig 2025-12-04T08:54:02.2822925Z * [new branch] gh/soulitzer/296/base -> origin/gh/soulitzer/296/base 2025-12-04T08:54:02.2822998Z * [new branch] gh/soulitzer/296/head -> origin/gh/soulitzer/296/head 2025-12-04T08:54:02.2823069Z * [new branch] gh/soulitzer/296/orig -> origin/gh/soulitzer/296/orig 2025-12-04T08:54:02.2823140Z * [new branch] gh/soulitzer/299/base -> origin/gh/soulitzer/299/base 2025-12-04T08:54:02.2823255Z * [new branch] gh/soulitzer/299/head -> origin/gh/soulitzer/299/head 2025-12-04T08:54:02.2823327Z * [new branch] gh/soulitzer/299/orig -> origin/gh/soulitzer/299/orig 2025-12-04T08:54:02.2823398Z * [new branch] gh/soulitzer/300/base -> origin/gh/soulitzer/300/base 2025-12-04T08:54:02.2823472Z * [new branch] gh/soulitzer/300/head -> origin/gh/soulitzer/300/head 2025-12-04T08:54:02.2823545Z * [new branch] gh/soulitzer/300/orig -> origin/gh/soulitzer/300/orig 2025-12-04T08:54:02.2823617Z * [new branch] gh/soulitzer/301/base -> origin/gh/soulitzer/301/base 2025-12-04T08:54:02.2823692Z * [new branch] gh/soulitzer/301/head -> origin/gh/soulitzer/301/head 2025-12-04T08:54:02.2823764Z * [new branch] gh/soulitzer/301/orig -> origin/gh/soulitzer/301/orig 2025-12-04T08:54:02.2823835Z * [new branch] gh/soulitzer/313/base -> origin/gh/soulitzer/313/base 2025-12-04T08:54:02.2823907Z * [new branch] gh/soulitzer/313/head -> origin/gh/soulitzer/313/head 2025-12-04T08:54:02.2823977Z * [new branch] gh/soulitzer/313/orig -> origin/gh/soulitzer/313/orig 2025-12-04T08:54:02.2824048Z * [new branch] gh/soulitzer/319/base -> origin/gh/soulitzer/319/base 2025-12-04T08:54:02.2824119Z * [new branch] gh/soulitzer/319/head -> origin/gh/soulitzer/319/head 2025-12-04T08:54:02.2824190Z * [new branch] gh/soulitzer/319/orig -> origin/gh/soulitzer/319/orig 2025-12-04T08:54:02.2824262Z * [new branch] gh/soulitzer/320/base -> origin/gh/soulitzer/320/base 2025-12-04T08:54:02.2824334Z * [new branch] gh/soulitzer/320/head -> origin/gh/soulitzer/320/head 2025-12-04T08:54:02.2824406Z * [new branch] gh/soulitzer/320/orig -> origin/gh/soulitzer/320/orig 2025-12-04T08:54:02.2824481Z * [new branch] gh/soulitzer/336/base -> origin/gh/soulitzer/336/base 2025-12-04T08:54:02.2824554Z * [new branch] gh/soulitzer/336/head -> origin/gh/soulitzer/336/head 2025-12-04T08:54:02.2824626Z * [new branch] gh/soulitzer/336/orig -> origin/gh/soulitzer/336/orig 2025-12-04T08:54:02.2824699Z * [new branch] gh/soulitzer/347/base -> origin/gh/soulitzer/347/base 2025-12-04T08:54:02.2824770Z * [new branch] gh/soulitzer/347/head -> origin/gh/soulitzer/347/head 2025-12-04T08:54:02.2824841Z * [new branch] gh/soulitzer/347/orig -> origin/gh/soulitzer/347/orig 2025-12-04T08:54:02.2824941Z * [new branch] gh/soulitzer/349/base -> origin/gh/soulitzer/349/base 2025-12-04T08:54:02.2825013Z * [new branch] gh/soulitzer/349/head -> origin/gh/soulitzer/349/head 2025-12-04T08:54:02.2825083Z * [new branch] gh/soulitzer/349/orig -> origin/gh/soulitzer/349/orig 2025-12-04T08:54:02.2825155Z * [new branch] gh/soulitzer/350/base -> origin/gh/soulitzer/350/base 2025-12-04T08:54:02.2825226Z * [new branch] gh/soulitzer/350/head -> origin/gh/soulitzer/350/head 2025-12-04T08:54:02.2825299Z * [new branch] gh/soulitzer/350/orig -> origin/gh/soulitzer/350/orig 2025-12-04T08:54:02.2825370Z * [new branch] gh/soulitzer/351/base -> origin/gh/soulitzer/351/base 2025-12-04T08:54:02.2825439Z * [new branch] gh/soulitzer/351/head -> origin/gh/soulitzer/351/head 2025-12-04T08:54:02.2825512Z * [new branch] gh/soulitzer/351/orig -> origin/gh/soulitzer/351/orig 2025-12-04T08:54:02.2825585Z * [new branch] gh/soulitzer/353/base -> origin/gh/soulitzer/353/base 2025-12-04T08:54:02.2825656Z * [new branch] gh/soulitzer/353/head -> origin/gh/soulitzer/353/head 2025-12-04T08:54:02.2825731Z * [new branch] gh/soulitzer/353/orig -> origin/gh/soulitzer/353/orig 2025-12-04T08:54:02.2825801Z * [new branch] gh/soulitzer/358/base -> origin/gh/soulitzer/358/base 2025-12-04T08:54:02.2825903Z * [new branch] gh/soulitzer/358/head -> origin/gh/soulitzer/358/head 2025-12-04T08:54:02.2826013Z * [new branch] gh/soulitzer/358/orig -> origin/gh/soulitzer/358/orig 2025-12-04T08:54:02.2826085Z * [new branch] gh/soulitzer/359/base -> origin/gh/soulitzer/359/base 2025-12-04T08:54:02.2826156Z * [new branch] gh/soulitzer/359/head -> origin/gh/soulitzer/359/head 2025-12-04T08:54:02.2826229Z * [new branch] gh/soulitzer/359/orig -> origin/gh/soulitzer/359/orig 2025-12-04T08:54:02.2826301Z * [new branch] gh/soulitzer/374/base -> origin/gh/soulitzer/374/base 2025-12-04T08:54:02.2826371Z * [new branch] gh/soulitzer/374/head -> origin/gh/soulitzer/374/head 2025-12-04T08:54:02.2826444Z * [new branch] gh/soulitzer/374/orig -> origin/gh/soulitzer/374/orig 2025-12-04T08:54:02.2826516Z * [new branch] gh/soulitzer/375/base -> origin/gh/soulitzer/375/base 2025-12-04T08:54:02.2826588Z * [new branch] gh/soulitzer/375/head -> origin/gh/soulitzer/375/head 2025-12-04T08:54:02.2826662Z * [new branch] gh/soulitzer/375/orig -> origin/gh/soulitzer/375/orig 2025-12-04T08:54:02.2826732Z * [new branch] gh/soulitzer/380/base -> origin/gh/soulitzer/380/base 2025-12-04T08:54:02.2826806Z * [new branch] gh/soulitzer/380/head -> origin/gh/soulitzer/380/head 2025-12-04T08:54:02.2826878Z * [new branch] gh/soulitzer/380/orig -> origin/gh/soulitzer/380/orig 2025-12-04T08:54:02.2826950Z * [new branch] gh/soulitzer/385/base -> origin/gh/soulitzer/385/base 2025-12-04T08:54:02.2827024Z * [new branch] gh/soulitzer/385/head -> origin/gh/soulitzer/385/head 2025-12-04T08:54:02.2827096Z * [new branch] gh/soulitzer/385/orig -> origin/gh/soulitzer/385/orig 2025-12-04T08:54:02.2827168Z * [new branch] gh/soulitzer/386/base -> origin/gh/soulitzer/386/base 2025-12-04T08:54:02.2827243Z * [new branch] gh/soulitzer/386/head -> origin/gh/soulitzer/386/head 2025-12-04T08:54:02.2827315Z * [new branch] gh/soulitzer/386/orig -> origin/gh/soulitzer/386/orig 2025-12-04T08:54:02.2827386Z * [new branch] gh/soulitzer/387/base -> origin/gh/soulitzer/387/base 2025-12-04T08:54:02.2827460Z * [new branch] gh/soulitzer/387/head -> origin/gh/soulitzer/387/head 2025-12-04T08:54:02.2827530Z * [new branch] gh/soulitzer/387/orig -> origin/gh/soulitzer/387/orig 2025-12-04T08:54:02.2827656Z * [new branch] gh/soulitzer/388/base -> origin/gh/soulitzer/388/base 2025-12-04T08:54:02.2827731Z * [new branch] gh/soulitzer/388/head -> origin/gh/soulitzer/388/head 2025-12-04T08:54:02.2827803Z * [new branch] gh/soulitzer/388/orig -> origin/gh/soulitzer/388/orig 2025-12-04T08:54:02.2827876Z * [new branch] gh/soulitzer/389/base -> origin/gh/soulitzer/389/base 2025-12-04T08:54:02.2827950Z * [new branch] gh/soulitzer/389/head -> origin/gh/soulitzer/389/head 2025-12-04T08:54:02.2828023Z * [new branch] gh/soulitzer/389/orig -> origin/gh/soulitzer/389/orig 2025-12-04T08:54:02.2828096Z * [new branch] gh/soulitzer/390/base -> origin/gh/soulitzer/390/base 2025-12-04T08:54:02.2828167Z * [new branch] gh/soulitzer/390/head -> origin/gh/soulitzer/390/head 2025-12-04T08:54:02.2828238Z * [new branch] gh/soulitzer/390/orig -> origin/gh/soulitzer/390/orig 2025-12-04T08:54:02.2828311Z * [new branch] gh/soulitzer/391/base -> origin/gh/soulitzer/391/base 2025-12-04T08:54:02.2828386Z * [new branch] gh/soulitzer/391/head -> origin/gh/soulitzer/391/head 2025-12-04T08:54:02.2828456Z * [new branch] gh/soulitzer/391/orig -> origin/gh/soulitzer/391/orig 2025-12-04T08:54:02.2828574Z * [new branch] gh/soulitzer/392/base -> origin/gh/soulitzer/392/base 2025-12-04T08:54:02.2828647Z * [new branch] gh/soulitzer/392/head -> origin/gh/soulitzer/392/head 2025-12-04T08:54:02.2828719Z * [new branch] gh/soulitzer/392/orig -> origin/gh/soulitzer/392/orig 2025-12-04T08:54:02.2828791Z * [new branch] gh/swolchok/728/next -> origin/gh/swolchok/728/next 2025-12-04T08:54:02.2828863Z * [new branch] gh/swolchok/819/base -> origin/gh/swolchok/819/base 2025-12-04T08:54:02.2828935Z * [new branch] gh/swolchok/819/head -> origin/gh/swolchok/819/head 2025-12-04T08:54:02.2829007Z * [new branch] gh/swolchok/819/orig -> origin/gh/swolchok/819/orig 2025-12-04T08:54:02.2829076Z * [new branch] gh/swolchok/824/base -> origin/gh/swolchok/824/base 2025-12-04T08:54:02.2829145Z * [new branch] gh/swolchok/824/head -> origin/gh/swolchok/824/head 2025-12-04T08:54:02.2829220Z * [new branch] gh/swolchok/824/orig -> origin/gh/swolchok/824/orig 2025-12-04T08:54:02.2829290Z * [new branch] gh/swolchok/829/base -> origin/gh/swolchok/829/base 2025-12-04T08:54:02.2829358Z * [new branch] gh/swolchok/829/head -> origin/gh/swolchok/829/head 2025-12-04T08:54:02.2829431Z * [new branch] gh/swolchok/829/orig -> origin/gh/swolchok/829/orig 2025-12-04T08:54:02.2829501Z * [new branch] gh/swolchok/839/base -> origin/gh/swolchok/839/base 2025-12-04T08:54:02.2829573Z * [new branch] gh/swolchok/839/head -> origin/gh/swolchok/839/head 2025-12-04T08:54:02.2829646Z * [new branch] gh/swolchok/839/orig -> origin/gh/swolchok/839/orig 2025-12-04T08:54:02.2829715Z * [new branch] gh/swolchok/841/base -> origin/gh/swolchok/841/base 2025-12-04T08:54:02.2829785Z * [new branch] gh/swolchok/841/head -> origin/gh/swolchok/841/head 2025-12-04T08:54:02.2829856Z * [new branch] gh/swolchok/841/orig -> origin/gh/swolchok/841/orig 2025-12-04T08:54:02.2829924Z * [new branch] gh/swolchok/842/base -> origin/gh/swolchok/842/base 2025-12-04T08:54:02.2829997Z * [new branch] gh/swolchok/842/head -> origin/gh/swolchok/842/head 2025-12-04T08:54:02.2830066Z * [new branch] gh/swolchok/842/orig -> origin/gh/swolchok/842/orig 2025-12-04T08:54:02.2830138Z * [new branch] gh/swolchok/845/base -> origin/gh/swolchok/845/base 2025-12-04T08:54:02.2830236Z * [new branch] gh/swolchok/845/head -> origin/gh/swolchok/845/head 2025-12-04T08:54:02.2830304Z * [new branch] gh/swolchok/845/orig -> origin/gh/swolchok/845/orig 2025-12-04T08:54:02.2830374Z * [new branch] gh/swolchok/848/base -> origin/gh/swolchok/848/base 2025-12-04T08:54:02.2830443Z * [new branch] gh/swolchok/848/head -> origin/gh/swolchok/848/head 2025-12-04T08:54:02.2830515Z * [new branch] gh/swolchok/848/orig -> origin/gh/swolchok/848/orig 2025-12-04T08:54:02.2830585Z * [new branch] gh/swolchok/856/base -> origin/gh/swolchok/856/base 2025-12-04T08:54:02.2830656Z * [new branch] gh/swolchok/856/head -> origin/gh/swolchok/856/head 2025-12-04T08:54:02.2830727Z * [new branch] gh/swolchok/856/orig -> origin/gh/swolchok/856/orig 2025-12-04T08:54:02.2830798Z * [new branch] gh/swolchok/860/base -> origin/gh/swolchok/860/base 2025-12-04T08:54:02.2830868Z * [new branch] gh/swolchok/860/head -> origin/gh/swolchok/860/head 2025-12-04T08:54:02.2830938Z * [new branch] gh/swolchok/860/orig -> origin/gh/swolchok/860/orig 2025-12-04T08:54:02.2831010Z * [new branch] gh/swolchok/861/base -> origin/gh/swolchok/861/base 2025-12-04T08:54:02.2831081Z * [new branch] gh/swolchok/861/head -> origin/gh/swolchok/861/head 2025-12-04T08:54:02.2831181Z * [new branch] gh/swolchok/861/orig -> origin/gh/swolchok/861/orig 2025-12-04T08:54:02.2831255Z * [new branch] gh/swolchok/862/base -> origin/gh/swolchok/862/base 2025-12-04T08:54:02.2831324Z * [new branch] gh/swolchok/862/head -> origin/gh/swolchok/862/head 2025-12-04T08:54:02.2831394Z * [new branch] gh/swolchok/862/orig -> origin/gh/swolchok/862/orig 2025-12-04T08:54:02.2831464Z * [new branch] gh/swolchok/863/base -> origin/gh/swolchok/863/base 2025-12-04T08:54:02.2831534Z * [new branch] gh/swolchok/863/head -> origin/gh/swolchok/863/head 2025-12-04T08:54:02.2831604Z * [new branch] gh/swolchok/863/orig -> origin/gh/swolchok/863/orig 2025-12-04T08:54:02.2831674Z * [new branch] gh/swolchok/864/base -> origin/gh/swolchok/864/base 2025-12-04T08:54:02.2831745Z * [new branch] gh/swolchok/864/head -> origin/gh/swolchok/864/head 2025-12-04T08:54:02.2831816Z * [new branch] gh/swolchok/864/orig -> origin/gh/swolchok/864/orig 2025-12-04T08:54:02.2831886Z * [new branch] gh/swolchok/865/base -> origin/gh/swolchok/865/base 2025-12-04T08:54:02.2831957Z * [new branch] gh/swolchok/865/head -> origin/gh/swolchok/865/head 2025-12-04T08:54:02.2832027Z * [new branch] gh/swolchok/865/orig -> origin/gh/swolchok/865/orig 2025-12-04T08:54:02.2832095Z * [new branch] gh/swolchok/866/base -> origin/gh/swolchok/866/base 2025-12-04T08:54:02.2832166Z * [new branch] gh/swolchok/866/head -> origin/gh/swolchok/866/head 2025-12-04T08:54:02.2832237Z * [new branch] gh/swolchok/866/orig -> origin/gh/swolchok/866/orig 2025-12-04T08:54:02.2832306Z * [new branch] gh/swolchok/867/base -> origin/gh/swolchok/867/base 2025-12-04T08:54:02.2832376Z * [new branch] gh/swolchok/867/head -> origin/gh/swolchok/867/head 2025-12-04T08:54:02.2832446Z * [new branch] gh/swolchok/867/orig -> origin/gh/swolchok/867/orig 2025-12-04T08:54:02.2832516Z * [new branch] gh/swolchok/868/base -> origin/gh/swolchok/868/base 2025-12-04T08:54:02.2832584Z * [new branch] gh/swolchok/868/head -> origin/gh/swolchok/868/head 2025-12-04T08:54:02.2832655Z * [new branch] gh/swolchok/868/orig -> origin/gh/swolchok/868/orig 2025-12-04T08:54:02.2832724Z * [new branch] gh/swolchok/869/base -> origin/gh/swolchok/869/base 2025-12-04T08:54:02.2832827Z * [new branch] gh/swolchok/869/head -> origin/gh/swolchok/869/head 2025-12-04T08:54:02.2832896Z * [new branch] gh/swolchok/869/orig -> origin/gh/swolchok/869/orig 2025-12-04T08:54:02.2832964Z * [new branch] gh/swolchok/870/base -> origin/gh/swolchok/870/base 2025-12-04T08:54:02.2833035Z * [new branch] gh/swolchok/870/head -> origin/gh/swolchok/870/head 2025-12-04T08:54:02.2833105Z * [new branch] gh/swolchok/870/orig -> origin/gh/swolchok/870/orig 2025-12-04T08:54:02.2833173Z * [new branch] gh/swolchok/871/base -> origin/gh/swolchok/871/base 2025-12-04T08:54:02.2833244Z * [new branch] gh/swolchok/871/head -> origin/gh/swolchok/871/head 2025-12-04T08:54:02.2833313Z * [new branch] gh/swolchok/871/orig -> origin/gh/swolchok/871/orig 2025-12-04T08:54:02.2833385Z * [new branch] gh/teja-rao/4/base -> origin/gh/teja-rao/4/base 2025-12-04T08:54:02.2833459Z * [new branch] gh/teja-rao/4/head -> origin/gh/teja-rao/4/head 2025-12-04T08:54:02.2833527Z * [new branch] gh/teja-rao/4/orig -> origin/gh/teja-rao/4/orig 2025-12-04T08:54:02.2833597Z * [new branch] gh/tianyu-l/2/base -> origin/gh/tianyu-l/2/base 2025-12-04T08:54:02.2833692Z * [new branch] gh/tianyu-l/2/head -> origin/gh/tianyu-l/2/head 2025-12-04T08:54:02.2833760Z * [new branch] gh/tianyu-l/2/orig -> origin/gh/tianyu-l/2/orig 2025-12-04T08:54:02.2833827Z * [new branch] gh/tianyu-l/3/base -> origin/gh/tianyu-l/3/base 2025-12-04T08:54:02.2833896Z * [new branch] gh/tianyu-l/3/orig -> origin/gh/tianyu-l/3/orig 2025-12-04T08:54:02.2833964Z * [new branch] gh/tianyu-l/4/base -> origin/gh/tianyu-l/4/base 2025-12-04T08:54:02.2834031Z * [new branch] gh/tianyu-l/4/head -> origin/gh/tianyu-l/4/head 2025-12-04T08:54:02.2834101Z * [new branch] gh/tianyu-l/4/orig -> origin/gh/tianyu-l/4/orig 2025-12-04T08:54:02.2834190Z * [new branch] gh/tugsbayasgalan/10/base -> origin/gh/tugsbayasgalan/10/base 2025-12-04T08:54:02.2834275Z * [new branch] gh/tugsbayasgalan/10/head -> origin/gh/tugsbayasgalan/10/head 2025-12-04T08:54:02.2834359Z * [new branch] gh/tugsbayasgalan/10/orig -> origin/gh/tugsbayasgalan/10/orig 2025-12-04T08:54:02.2834442Z * [new branch] gh/tugsbayasgalan/13/base -> origin/gh/tugsbayasgalan/13/base 2025-12-04T08:54:02.2834522Z * [new branch] gh/tugsbayasgalan/13/head -> origin/gh/tugsbayasgalan/13/head 2025-12-04T08:54:02.2834603Z * [new branch] gh/tugsbayasgalan/13/orig -> origin/gh/tugsbayasgalan/13/orig 2025-12-04T08:54:02.2834683Z * [new branch] gh/tugsbayasgalan/17/base -> origin/gh/tugsbayasgalan/17/base 2025-12-04T08:54:02.2834766Z * [new branch] gh/tugsbayasgalan/17/head -> origin/gh/tugsbayasgalan/17/head 2025-12-04T08:54:02.2834847Z * [new branch] gh/tugsbayasgalan/17/orig -> origin/gh/tugsbayasgalan/17/orig 2025-12-04T08:54:02.2834929Z * [new branch] gh/tugsbayasgalan/2/base -> origin/gh/tugsbayasgalan/2/base 2025-12-04T08:54:02.2835009Z * [new branch] gh/tugsbayasgalan/2/head -> origin/gh/tugsbayasgalan/2/head 2025-12-04T08:54:02.2835089Z * [new branch] gh/tugsbayasgalan/2/orig -> origin/gh/tugsbayasgalan/2/orig 2025-12-04T08:54:02.2835171Z * [new branch] gh/tugsbayasgalan/28/base -> origin/gh/tugsbayasgalan/28/base 2025-12-04T08:54:02.2835252Z * [new branch] gh/tugsbayasgalan/28/head -> origin/gh/tugsbayasgalan/28/head 2025-12-04T08:54:02.2835334Z * [new branch] gh/tugsbayasgalan/28/orig -> origin/gh/tugsbayasgalan/28/orig 2025-12-04T08:54:02.2835414Z * [new branch] gh/tugsbayasgalan/32/base -> origin/gh/tugsbayasgalan/32/base 2025-12-04T08:54:02.2835527Z * [new branch] gh/tugsbayasgalan/32/head -> origin/gh/tugsbayasgalan/32/head 2025-12-04T08:54:02.2835607Z * [new branch] gh/tugsbayasgalan/32/orig -> origin/gh/tugsbayasgalan/32/orig 2025-12-04T08:54:02.2835688Z * [new branch] gh/tugsbayasgalan/35/base -> origin/gh/tugsbayasgalan/35/base 2025-12-04T08:54:02.2835772Z * [new branch] gh/tugsbayasgalan/35/head -> origin/gh/tugsbayasgalan/35/head 2025-12-04T08:54:02.2835853Z * [new branch] gh/tugsbayasgalan/35/orig -> origin/gh/tugsbayasgalan/35/orig 2025-12-04T08:54:02.2835956Z * [new branch] gh/tugsbayasgalan/36/base -> origin/gh/tugsbayasgalan/36/base 2025-12-04T08:54:02.2836037Z * [new branch] gh/tugsbayasgalan/36/head -> origin/gh/tugsbayasgalan/36/head 2025-12-04T08:54:02.2836118Z * [new branch] gh/tugsbayasgalan/36/orig -> origin/gh/tugsbayasgalan/36/orig 2025-12-04T08:54:02.2836202Z * [new branch] gh/tugsbayasgalan/37/base -> origin/gh/tugsbayasgalan/37/base 2025-12-04T08:54:02.2836284Z * [new branch] gh/tugsbayasgalan/37/head -> origin/gh/tugsbayasgalan/37/head 2025-12-04T08:54:02.2836365Z * [new branch] gh/tugsbayasgalan/37/orig -> origin/gh/tugsbayasgalan/37/orig 2025-12-04T08:54:02.2836488Z * [new branch] gh/tugsbayasgalan/43/base -> origin/gh/tugsbayasgalan/43/base 2025-12-04T08:54:02.2836569Z * [new branch] gh/tugsbayasgalan/43/head -> origin/gh/tugsbayasgalan/43/head 2025-12-04T08:54:02.2836650Z * [new branch] gh/tugsbayasgalan/43/orig -> origin/gh/tugsbayasgalan/43/orig 2025-12-04T08:54:02.2836731Z * [new branch] gh/tugsbayasgalan/48/base -> origin/gh/tugsbayasgalan/48/base 2025-12-04T08:54:02.2836813Z * [new branch] gh/tugsbayasgalan/48/head -> origin/gh/tugsbayasgalan/48/head 2025-12-04T08:54:02.2836893Z * [new branch] gh/tugsbayasgalan/48/orig -> origin/gh/tugsbayasgalan/48/orig 2025-12-04T08:54:02.2836976Z * [new branch] gh/tugsbayasgalan/51/base -> origin/gh/tugsbayasgalan/51/base 2025-12-04T08:54:02.2837056Z * [new branch] gh/tugsbayasgalan/51/head -> origin/gh/tugsbayasgalan/51/head 2025-12-04T08:54:02.2837136Z * [new branch] gh/tugsbayasgalan/51/orig -> origin/gh/tugsbayasgalan/51/orig 2025-12-04T08:54:02.2837220Z * [new branch] gh/tugsbayasgalan/52/base -> origin/gh/tugsbayasgalan/52/base 2025-12-04T08:54:02.2837300Z * [new branch] gh/tugsbayasgalan/52/head -> origin/gh/tugsbayasgalan/52/head 2025-12-04T08:54:02.2837380Z * [new branch] gh/tugsbayasgalan/52/orig -> origin/gh/tugsbayasgalan/52/orig 2025-12-04T08:54:02.2837461Z * [new branch] gh/tugsbayasgalan/53/base -> origin/gh/tugsbayasgalan/53/base 2025-12-04T08:54:02.2837541Z * [new branch] gh/tugsbayasgalan/53/head -> origin/gh/tugsbayasgalan/53/head 2025-12-04T08:54:02.2837623Z * [new branch] gh/tugsbayasgalan/53/orig -> origin/gh/tugsbayasgalan/53/orig 2025-12-04T08:54:02.2837704Z * [new branch] gh/tugsbayasgalan/55/base -> origin/gh/tugsbayasgalan/55/base 2025-12-04T08:54:02.2837784Z * [new branch] gh/tugsbayasgalan/55/head -> origin/gh/tugsbayasgalan/55/head 2025-12-04T08:54:02.2837866Z * [new branch] gh/tugsbayasgalan/55/orig -> origin/gh/tugsbayasgalan/55/orig 2025-12-04T08:54:02.2837947Z * [new branch] gh/tugsbayasgalan/59/base -> origin/gh/tugsbayasgalan/59/base 2025-12-04T08:54:02.2838027Z * [new branch] gh/tugsbayasgalan/59/head -> origin/gh/tugsbayasgalan/59/head 2025-12-04T08:54:02.2838108Z * [new branch] gh/tugsbayasgalan/59/orig -> origin/gh/tugsbayasgalan/59/orig 2025-12-04T08:54:02.2838188Z * [new branch] gh/tugsbayasgalan/6/base -> origin/gh/tugsbayasgalan/6/base 2025-12-04T08:54:02.2838308Z * [new branch] gh/tugsbayasgalan/6/head -> origin/gh/tugsbayasgalan/6/head 2025-12-04T08:54:02.2838387Z * [new branch] gh/tugsbayasgalan/6/orig -> origin/gh/tugsbayasgalan/6/orig 2025-12-04T08:54:02.2838468Z * [new branch] gh/tugsbayasgalan/60/base -> origin/gh/tugsbayasgalan/60/base 2025-12-04T08:54:02.2838548Z * [new branch] gh/tugsbayasgalan/60/head -> origin/gh/tugsbayasgalan/60/head 2025-12-04T08:54:02.2838632Z * [new branch] gh/tugsbayasgalan/60/orig -> origin/gh/tugsbayasgalan/60/orig 2025-12-04T08:54:02.2838712Z * [new branch] gh/tugsbayasgalan/61/base -> origin/gh/tugsbayasgalan/61/base 2025-12-04T08:54:02.2838795Z * [new branch] gh/tugsbayasgalan/61/head -> origin/gh/tugsbayasgalan/61/head 2025-12-04T08:54:02.2838877Z * [new branch] gh/tugsbayasgalan/61/orig -> origin/gh/tugsbayasgalan/61/orig 2025-12-04T08:54:02.2838957Z * [new branch] gh/tugsbayasgalan/63/base -> origin/gh/tugsbayasgalan/63/base 2025-12-04T08:54:02.2839039Z * [new branch] gh/tugsbayasgalan/63/head -> origin/gh/tugsbayasgalan/63/head 2025-12-04T08:54:02.2839120Z * [new branch] gh/tugsbayasgalan/63/orig -> origin/gh/tugsbayasgalan/63/orig 2025-12-04T08:54:02.2839202Z * [new branch] gh/tugsbayasgalan/67/base -> origin/gh/tugsbayasgalan/67/base 2025-12-04T08:54:02.2839313Z * [new branch] gh/tugsbayasgalan/67/head -> origin/gh/tugsbayasgalan/67/head 2025-12-04T08:54:02.2839396Z * [new branch] gh/tugsbayasgalan/67/orig -> origin/gh/tugsbayasgalan/67/orig 2025-12-04T08:54:02.2839476Z * [new branch] gh/tugsbayasgalan/68/base -> origin/gh/tugsbayasgalan/68/base 2025-12-04T08:54:02.2839558Z * [new branch] gh/tugsbayasgalan/68/head -> origin/gh/tugsbayasgalan/68/head 2025-12-04T08:54:02.2839640Z * [new branch] gh/tugsbayasgalan/68/orig -> origin/gh/tugsbayasgalan/68/orig 2025-12-04T08:54:02.2839721Z * [new branch] gh/tugsbayasgalan/7/base -> origin/gh/tugsbayasgalan/7/base 2025-12-04T08:54:02.2839802Z * [new branch] gh/tugsbayasgalan/7/head -> origin/gh/tugsbayasgalan/7/head 2025-12-04T08:54:02.2839880Z * [new branch] gh/tugsbayasgalan/7/orig -> origin/gh/tugsbayasgalan/7/orig 2025-12-04T08:54:02.2839964Z * [new branch] gh/tugsbayasgalan/70/base -> origin/gh/tugsbayasgalan/70/base 2025-12-04T08:54:02.2840048Z * [new branch] gh/tugsbayasgalan/70/head -> origin/gh/tugsbayasgalan/70/head 2025-12-04T08:54:02.2840128Z * [new branch] gh/tugsbayasgalan/70/orig -> origin/gh/tugsbayasgalan/70/orig 2025-12-04T08:54:02.2840208Z * [new branch] gh/tugsbayasgalan/71/base -> origin/gh/tugsbayasgalan/71/base 2025-12-04T08:54:02.2840289Z * [new branch] gh/tugsbayasgalan/71/head -> origin/gh/tugsbayasgalan/71/head 2025-12-04T08:54:02.2840368Z * [new branch] gh/tugsbayasgalan/71/orig -> origin/gh/tugsbayasgalan/71/orig 2025-12-04T08:54:02.2840451Z * [new branch] gh/tugsbayasgalan/72/base -> origin/gh/tugsbayasgalan/72/base 2025-12-04T08:54:02.2840533Z * [new branch] gh/tugsbayasgalan/72/head -> origin/gh/tugsbayasgalan/72/head 2025-12-04T08:54:02.2840613Z * [new branch] gh/tugsbayasgalan/72/orig -> origin/gh/tugsbayasgalan/72/orig 2025-12-04T08:54:02.2840694Z * [new branch] gh/tugsbayasgalan/73/base -> origin/gh/tugsbayasgalan/73/base 2025-12-04T08:54:02.2840776Z * [new branch] gh/tugsbayasgalan/73/head -> origin/gh/tugsbayasgalan/73/head 2025-12-04T08:54:02.2840856Z * [new branch] gh/tugsbayasgalan/73/orig -> origin/gh/tugsbayasgalan/73/orig 2025-12-04T08:54:02.2840938Z * [new branch] gh/tugsbayasgalan/74/base -> origin/gh/tugsbayasgalan/74/base 2025-12-04T08:54:02.2841018Z * [new branch] gh/tugsbayasgalan/74/head -> origin/gh/tugsbayasgalan/74/head 2025-12-04T08:54:02.2841126Z * [new branch] gh/tugsbayasgalan/74/orig -> origin/gh/tugsbayasgalan/74/orig 2025-12-04T08:54:02.2841207Z * [new branch] gh/tugsbayasgalan/75/base -> origin/gh/tugsbayasgalan/75/base 2025-12-04T08:54:02.2841287Z * [new branch] gh/tugsbayasgalan/75/head -> origin/gh/tugsbayasgalan/75/head 2025-12-04T08:54:02.2841368Z * [new branch] gh/tugsbayasgalan/75/orig -> origin/gh/tugsbayasgalan/75/orig 2025-12-04T08:54:02.2841450Z * [new branch] gh/tugsbayasgalan/76/base -> origin/gh/tugsbayasgalan/76/base 2025-12-04T08:54:02.2841530Z * [new branch] gh/tugsbayasgalan/76/head -> origin/gh/tugsbayasgalan/76/head 2025-12-04T08:54:02.2841611Z * [new branch] gh/tugsbayasgalan/76/orig -> origin/gh/tugsbayasgalan/76/orig 2025-12-04T08:54:02.2841693Z * [new branch] gh/tugsbayasgalan/77/base -> origin/gh/tugsbayasgalan/77/base 2025-12-04T08:54:02.2841774Z * [new branch] gh/tugsbayasgalan/77/head -> origin/gh/tugsbayasgalan/77/head 2025-12-04T08:54:02.2841855Z * [new branch] gh/tugsbayasgalan/77/orig -> origin/gh/tugsbayasgalan/77/orig 2025-12-04T08:54:02.2841937Z * [new branch] gh/tugsbayasgalan/78/base -> origin/gh/tugsbayasgalan/78/base 2025-12-04T08:54:02.2842019Z * [new branch] gh/tugsbayasgalan/78/head -> origin/gh/tugsbayasgalan/78/head 2025-12-04T08:54:02.2842131Z * [new branch] gh/tugsbayasgalan/78/orig -> origin/gh/tugsbayasgalan/78/orig 2025-12-04T08:54:02.2842214Z * [new branch] gh/tugsbayasgalan/79/base -> origin/gh/tugsbayasgalan/79/base 2025-12-04T08:54:02.2842294Z * [new branch] gh/tugsbayasgalan/79/head -> origin/gh/tugsbayasgalan/79/head 2025-12-04T08:54:02.2842375Z * [new branch] gh/tugsbayasgalan/79/orig -> origin/gh/tugsbayasgalan/79/orig 2025-12-04T08:54:02.2842455Z * [new branch] gh/tugsbayasgalan/8/base -> origin/gh/tugsbayasgalan/8/base 2025-12-04T08:54:02.2842536Z * [new branch] gh/tugsbayasgalan/8/head -> origin/gh/tugsbayasgalan/8/head 2025-12-04T08:54:02.2842615Z * [new branch] gh/tugsbayasgalan/8/orig -> origin/gh/tugsbayasgalan/8/orig 2025-12-04T08:54:02.2842695Z * [new branch] gh/tugsbayasgalan/80/base -> origin/gh/tugsbayasgalan/80/base 2025-12-04T08:54:02.2842778Z * [new branch] gh/tugsbayasgalan/80/head -> origin/gh/tugsbayasgalan/80/head 2025-12-04T08:54:02.2842861Z * [new branch] gh/tugsbayasgalan/80/orig -> origin/gh/tugsbayasgalan/80/orig 2025-12-04T08:54:02.2842941Z * [new branch] gh/tugsbayasgalan/81/base -> origin/gh/tugsbayasgalan/81/base 2025-12-04T08:54:02.2843020Z * [new branch] gh/tugsbayasgalan/81/head -> origin/gh/tugsbayasgalan/81/head 2025-12-04T08:54:02.2843101Z * [new branch] gh/tugsbayasgalan/81/orig -> origin/gh/tugsbayasgalan/81/orig 2025-12-04T08:54:02.2843184Z * [new branch] gh/tugsbayasgalan/82/base -> origin/gh/tugsbayasgalan/82/base 2025-12-04T08:54:02.2843263Z * [new branch] gh/tugsbayasgalan/82/head -> origin/gh/tugsbayasgalan/82/head 2025-12-04T08:54:02.2843345Z * [new branch] gh/tugsbayasgalan/82/orig -> origin/gh/tugsbayasgalan/82/orig 2025-12-04T08:54:02.2843427Z * [new branch] gh/tugsbayasgalan/83/base -> origin/gh/tugsbayasgalan/83/base 2025-12-04T08:54:02.2843508Z * [new branch] gh/tugsbayasgalan/83/head -> origin/gh/tugsbayasgalan/83/head 2025-12-04T08:54:02.2843590Z * [new branch] gh/tugsbayasgalan/83/orig -> origin/gh/tugsbayasgalan/83/orig 2025-12-04T08:54:02.2843671Z * [new branch] gh/tugsbayasgalan/84/base -> origin/gh/tugsbayasgalan/84/base 2025-12-04T08:54:02.2843751Z * [new branch] gh/tugsbayasgalan/84/head -> origin/gh/tugsbayasgalan/84/head 2025-12-04T08:54:02.2843835Z * [new branch] gh/tugsbayasgalan/84/orig -> origin/gh/tugsbayasgalan/84/orig 2025-12-04T08:54:02.2843951Z * [new branch] gh/tugsbayasgalan/85/base -> origin/gh/tugsbayasgalan/85/base 2025-12-04T08:54:02.2844033Z * [new branch] gh/tugsbayasgalan/85/head -> origin/gh/tugsbayasgalan/85/head 2025-12-04T08:54:02.2844114Z * [new branch] gh/tugsbayasgalan/85/orig -> origin/gh/tugsbayasgalan/85/orig 2025-12-04T08:54:02.2844195Z * [new branch] gh/tugsbayasgalan/86/base -> origin/gh/tugsbayasgalan/86/base 2025-12-04T08:54:02.2844276Z * [new branch] gh/tugsbayasgalan/86/head -> origin/gh/tugsbayasgalan/86/head 2025-12-04T08:54:02.2844356Z * [new branch] gh/tugsbayasgalan/86/orig -> origin/gh/tugsbayasgalan/86/orig 2025-12-04T08:54:02.2844436Z * [new branch] gh/tugsbayasgalan/87/base -> origin/gh/tugsbayasgalan/87/base 2025-12-04T08:54:02.2844516Z * [new branch] gh/tugsbayasgalan/87/head -> origin/gh/tugsbayasgalan/87/head 2025-12-04T08:54:02.2844598Z * [new branch] gh/tugsbayasgalan/87/orig -> origin/gh/tugsbayasgalan/87/orig 2025-12-04T08:54:02.2844679Z * [new branch] gh/tugsbayasgalan/88/base -> origin/gh/tugsbayasgalan/88/base 2025-12-04T08:54:02.2844760Z * [new branch] gh/tugsbayasgalan/88/head -> origin/gh/tugsbayasgalan/88/head 2025-12-04T08:54:02.2844867Z * [new branch] gh/tugsbayasgalan/88/orig -> origin/gh/tugsbayasgalan/88/orig 2025-12-04T08:54:02.2844947Z * [new branch] gh/tugsbayasgalan/89/base -> origin/gh/tugsbayasgalan/89/base 2025-12-04T08:54:02.2845029Z * [new branch] gh/tugsbayasgalan/89/head -> origin/gh/tugsbayasgalan/89/head 2025-12-04T08:54:02.2845109Z * [new branch] gh/tugsbayasgalan/89/orig -> origin/gh/tugsbayasgalan/89/orig 2025-12-04T08:54:02.2845189Z * [new branch] gh/tugsbayasgalan/9/base -> origin/gh/tugsbayasgalan/9/base 2025-12-04T08:54:02.2845271Z * [new branch] gh/tugsbayasgalan/9/head -> origin/gh/tugsbayasgalan/9/head 2025-12-04T08:54:02.2845349Z * [new branch] gh/tugsbayasgalan/9/orig -> origin/gh/tugsbayasgalan/9/orig 2025-12-04T08:54:02.2845430Z * [new branch] gh/tugsbayasgalan/90/base -> origin/gh/tugsbayasgalan/90/base 2025-12-04T08:54:02.2845512Z * [new branch] gh/tugsbayasgalan/90/head -> origin/gh/tugsbayasgalan/90/head 2025-12-04T08:54:02.2845594Z * [new branch] gh/tugsbayasgalan/90/orig -> origin/gh/tugsbayasgalan/90/orig 2025-12-04T08:54:02.2845676Z * [new branch] gh/tugsbayasgalan/91/base -> origin/gh/tugsbayasgalan/91/base 2025-12-04T08:54:02.2845757Z * [new branch] gh/tugsbayasgalan/91/head -> origin/gh/tugsbayasgalan/91/head 2025-12-04T08:54:02.2845838Z * [new branch] gh/tugsbayasgalan/91/orig -> origin/gh/tugsbayasgalan/91/orig 2025-12-04T08:54:02.2845919Z * [new branch] gh/tugsbayasgalan/92/base -> origin/gh/tugsbayasgalan/92/base 2025-12-04T08:54:02.2846032Z * [new branch] gh/tugsbayasgalan/92/head -> origin/gh/tugsbayasgalan/92/head 2025-12-04T08:54:02.2846111Z * [new branch] gh/tugsbayasgalan/92/orig -> origin/gh/tugsbayasgalan/92/orig 2025-12-04T08:54:02.2846192Z * [new branch] gh/tugsbayasgalan/93/base -> origin/gh/tugsbayasgalan/93/base 2025-12-04T08:54:02.2846274Z * [new branch] gh/tugsbayasgalan/93/head -> origin/gh/tugsbayasgalan/93/head 2025-12-04T08:54:02.2846355Z * [new branch] gh/tugsbayasgalan/93/orig -> origin/gh/tugsbayasgalan/93/orig 2025-12-04T08:54:02.2846423Z * [new branch] gh/v0i0/14/base -> origin/gh/v0i0/14/base 2025-12-04T08:54:02.2846486Z * [new branch] gh/v0i0/14/head -> origin/gh/v0i0/14/head 2025-12-04T08:54:02.2846550Z * [new branch] gh/v0i0/14/orig -> origin/gh/v0i0/14/orig 2025-12-04T08:54:02.2846614Z * [new branch] gh/v0i0/15/base -> origin/gh/v0i0/15/base 2025-12-04T08:54:02.2846726Z * [new branch] gh/v0i0/15/head -> origin/gh/v0i0/15/head 2025-12-04T08:54:02.2846787Z * [new branch] gh/v0i0/15/orig -> origin/gh/v0i0/15/orig 2025-12-04T08:54:02.2846851Z * [new branch] gh/v0i0/16/base -> origin/gh/v0i0/16/base 2025-12-04T08:54:02.2846914Z * [new branch] gh/v0i0/16/head -> origin/gh/v0i0/16/head 2025-12-04T08:54:02.2846975Z * [new branch] gh/v0i0/16/orig -> origin/gh/v0i0/16/orig 2025-12-04T08:54:02.2847037Z * [new branch] gh/v0i0/17/base -> origin/gh/v0i0/17/base 2025-12-04T08:54:02.2847098Z * [new branch] gh/v0i0/17/head -> origin/gh/v0i0/17/head 2025-12-04T08:54:02.2847160Z * [new branch] gh/v0i0/17/orig -> origin/gh/v0i0/17/orig 2025-12-04T08:54:02.2847222Z * [new branch] gh/v0i0/18/base -> origin/gh/v0i0/18/base 2025-12-04T08:54:02.2847285Z * [new branch] gh/v0i0/18/head -> origin/gh/v0i0/18/head 2025-12-04T08:54:02.2847348Z * [new branch] gh/v0i0/18/orig -> origin/gh/v0i0/18/orig 2025-12-04T08:54:02.2847409Z * [new branch] gh/v0i0/19/base -> origin/gh/v0i0/19/base 2025-12-04T08:54:02.2847508Z * [new branch] gh/v0i0/19/head -> origin/gh/v0i0/19/head 2025-12-04T08:54:02.2847572Z * [new branch] gh/v0i0/19/orig -> origin/gh/v0i0/19/orig 2025-12-04T08:54:02.2847651Z * [new branch] gh/vishal9-team/1/base -> origin/gh/vishal9-team/1/base 2025-12-04T08:54:02.2847726Z * [new branch] gh/vishal9-team/1/head -> origin/gh/vishal9-team/1/head 2025-12-04T08:54:02.2847801Z * [new branch] gh/vishal9-team/2/base -> origin/gh/vishal9-team/2/base 2025-12-04T08:54:02.2847873Z * [new branch] gh/vishal9-team/2/head -> origin/gh/vishal9-team/2/head 2025-12-04T08:54:02.2847946Z * [new branch] gh/vishal9-team/2/orig -> origin/gh/vishal9-team/2/orig 2025-12-04T08:54:02.2848019Z * [new branch] gh/vishal9-team/3/base -> origin/gh/vishal9-team/3/base 2025-12-04T08:54:02.2848091Z * [new branch] gh/vishal9-team/3/head -> origin/gh/vishal9-team/3/head 2025-12-04T08:54:02.2848165Z * [new branch] gh/vishal9-team/3/orig -> origin/gh/vishal9-team/3/orig 2025-12-04T08:54:02.2848238Z * [new branch] gh/vishal9-team/4/base -> origin/gh/vishal9-team/4/base 2025-12-04T08:54:02.2848310Z * [new branch] gh/vishal9-team/4/head -> origin/gh/vishal9-team/4/head 2025-12-04T08:54:02.2848382Z * [new branch] gh/vishal9-team/4/orig -> origin/gh/vishal9-team/4/orig 2025-12-04T08:54:02.2848447Z * [new branch] gh/vkuzo/1/next -> origin/gh/vkuzo/1/next 2025-12-04T08:54:02.2848512Z * [new branch] gh/vkuzo/2/next -> origin/gh/vkuzo/2/next 2025-12-04T08:54:02.2848577Z * [new branch] gh/vkuzo/3/next -> origin/gh/vkuzo/3/next 2025-12-04T08:54:02.2848651Z * [new branch] gh/wconstab/424/base -> origin/gh/wconstab/424/base 2025-12-04T08:54:02.2848723Z * [new branch] gh/wconstab/424/head -> origin/gh/wconstab/424/head 2025-12-04T08:54:02.2848794Z * [new branch] gh/wconstab/424/orig -> origin/gh/wconstab/424/orig 2025-12-04T08:54:02.2848866Z * [new branch] gh/wconstab/435/base -> origin/gh/wconstab/435/base 2025-12-04T08:54:02.2848935Z * [new branch] gh/wconstab/435/head -> origin/gh/wconstab/435/head 2025-12-04T08:54:02.2849005Z * [new branch] gh/wconstab/435/orig -> origin/gh/wconstab/435/orig 2025-12-04T08:54:02.2849074Z * [new branch] gh/wconstab/444/base -> origin/gh/wconstab/444/base 2025-12-04T08:54:02.2849143Z * [new branch] gh/wconstab/444/head -> origin/gh/wconstab/444/head 2025-12-04T08:54:02.2849241Z * [new branch] gh/wconstab/444/orig -> origin/gh/wconstab/444/orig 2025-12-04T08:54:02.2849309Z * [new branch] gh/wconstab/447/base -> origin/gh/wconstab/447/base 2025-12-04T08:54:02.2849379Z * [new branch] gh/wconstab/447/head -> origin/gh/wconstab/447/head 2025-12-04T08:54:02.2849450Z * [new branch] gh/wconstab/447/orig -> origin/gh/wconstab/447/orig 2025-12-04T08:54:02.2849519Z * [new branch] gh/wconstab/448/base -> origin/gh/wconstab/448/base 2025-12-04T08:54:02.2849587Z * [new branch] gh/wconstab/448/head -> origin/gh/wconstab/448/head 2025-12-04T08:54:02.2849658Z * [new branch] gh/wconstab/448/orig -> origin/gh/wconstab/448/orig 2025-12-04T08:54:02.2849728Z * [new branch] gh/wconstab/449/base -> origin/gh/wconstab/449/base 2025-12-04T08:54:02.2849797Z * [new branch] gh/wconstab/449/head -> origin/gh/wconstab/449/head 2025-12-04T08:54:02.2849869Z * [new branch] gh/wconstab/449/orig -> origin/gh/wconstab/449/orig 2025-12-04T08:54:02.2849938Z * [new branch] gh/wconstab/450/base -> origin/gh/wconstab/450/base 2025-12-04T08:54:02.2850006Z * [new branch] gh/wconstab/450/head -> origin/gh/wconstab/450/head 2025-12-04T08:54:02.2850102Z * [new branch] gh/wconstab/450/orig -> origin/gh/wconstab/450/orig 2025-12-04T08:54:02.2850171Z * [new branch] gh/wconstab/451/base -> origin/gh/wconstab/451/base 2025-12-04T08:54:02.2850240Z * [new branch] gh/wconstab/451/head -> origin/gh/wconstab/451/head 2025-12-04T08:54:02.2850310Z * [new branch] gh/wconstab/451/orig -> origin/gh/wconstab/451/orig 2025-12-04T08:54:02.2850379Z * [new branch] gh/wconstab/452/base -> origin/gh/wconstab/452/base 2025-12-04T08:54:02.2850449Z * [new branch] gh/wconstab/452/head -> origin/gh/wconstab/452/head 2025-12-04T08:54:02.2850519Z * [new branch] gh/wconstab/452/orig -> origin/gh/wconstab/452/orig 2025-12-04T08:54:02.2850588Z * [new branch] gh/wconstab/453/base -> origin/gh/wconstab/453/base 2025-12-04T08:54:02.2850658Z * [new branch] gh/wconstab/453/head -> origin/gh/wconstab/453/head 2025-12-04T08:54:02.2850728Z * [new branch] gh/wconstab/453/orig -> origin/gh/wconstab/453/orig 2025-12-04T08:54:02.2850797Z * [new branch] gh/wconstab/454/base -> origin/gh/wconstab/454/base 2025-12-04T08:54:02.2850867Z * [new branch] gh/wconstab/454/head -> origin/gh/wconstab/454/head 2025-12-04T08:54:02.2850936Z * [new branch] gh/wconstab/454/orig -> origin/gh/wconstab/454/orig 2025-12-04T08:54:02.2851004Z * [new branch] gh/wconstab/455/base -> origin/gh/wconstab/455/base 2025-12-04T08:54:02.2851075Z * [new branch] gh/wconstab/455/head -> origin/gh/wconstab/455/head 2025-12-04T08:54:02.2851146Z * [new branch] gh/wconstab/455/orig -> origin/gh/wconstab/455/orig 2025-12-04T08:54:02.2851216Z * [new branch] gh/wconstab/456/base -> origin/gh/wconstab/456/base 2025-12-04T08:54:02.2851285Z * [new branch] gh/wconstab/456/head -> origin/gh/wconstab/456/head 2025-12-04T08:54:02.2851355Z * [new branch] gh/wconstab/456/orig -> origin/gh/wconstab/456/orig 2025-12-04T08:54:02.2851423Z * [new branch] gh/wconstab/457/base -> origin/gh/wconstab/457/base 2025-12-04T08:54:02.2851493Z * [new branch] gh/wconstab/457/head -> origin/gh/wconstab/457/head 2025-12-04T08:54:02.2851563Z * [new branch] gh/wconstab/457/orig -> origin/gh/wconstab/457/orig 2025-12-04T08:54:02.2851631Z * [new branch] gh/wconstab/458/base -> origin/gh/wconstab/458/base 2025-12-04T08:54:02.2851730Z * [new branch] gh/wconstab/458/head -> origin/gh/wconstab/458/head 2025-12-04T08:54:02.2851799Z * [new branch] gh/wconstab/458/orig -> origin/gh/wconstab/458/orig 2025-12-04T08:54:02.2851868Z * [new branch] gh/wconstab/459/base -> origin/gh/wconstab/459/base 2025-12-04T08:54:02.2851938Z * [new branch] gh/wconstab/459/head -> origin/gh/wconstab/459/head 2025-12-04T08:54:02.2852007Z * [new branch] gh/wconstab/459/orig -> origin/gh/wconstab/459/orig 2025-12-04T08:54:02.2852078Z * [new branch] gh/wconstab/460/base -> origin/gh/wconstab/460/base 2025-12-04T08:54:02.2852147Z * [new branch] gh/wconstab/460/head -> origin/gh/wconstab/460/head 2025-12-04T08:54:02.2852216Z * [new branch] gh/wconstab/460/orig -> origin/gh/wconstab/460/orig 2025-12-04T08:54:02.2852285Z * [new branch] gh/wconstab/461/base -> origin/gh/wconstab/461/base 2025-12-04T08:54:02.2852355Z * [new branch] gh/wconstab/461/head -> origin/gh/wconstab/461/head 2025-12-04T08:54:02.2852423Z * [new branch] gh/wconstab/461/orig -> origin/gh/wconstab/461/orig 2025-12-04T08:54:02.2852494Z * [new branch] gh/wconstab/462/base -> origin/gh/wconstab/462/base 2025-12-04T08:54:02.2852563Z * [new branch] gh/wconstab/462/head -> origin/gh/wconstab/462/head 2025-12-04T08:54:02.2852668Z * [new branch] gh/wconstab/462/orig -> origin/gh/wconstab/462/orig 2025-12-04T08:54:02.2852740Z * [new branch] gh/wconstab/463/base -> origin/gh/wconstab/463/base 2025-12-04T08:54:02.2852809Z * [new branch] gh/wconstab/463/head -> origin/gh/wconstab/463/head 2025-12-04T08:54:02.2852878Z * [new branch] gh/wconstab/463/orig -> origin/gh/wconstab/463/orig 2025-12-04T08:54:02.2852948Z * [new branch] gh/wconstab/464/base -> origin/gh/wconstab/464/base 2025-12-04T08:54:02.2853019Z * [new branch] gh/wconstab/464/head -> origin/gh/wconstab/464/head 2025-12-04T08:54:02.2853087Z * [new branch] gh/wconstab/464/orig -> origin/gh/wconstab/464/orig 2025-12-04T08:54:02.2853157Z * [new branch] gh/wconstab/465/base -> origin/gh/wconstab/465/base 2025-12-04T08:54:02.2853226Z * [new branch] gh/wconstab/465/head -> origin/gh/wconstab/465/head 2025-12-04T08:54:02.2853296Z * [new branch] gh/wconstab/465/orig -> origin/gh/wconstab/465/orig 2025-12-04T08:54:02.2853367Z * [new branch] gh/wconstab/466/base -> origin/gh/wconstab/466/base 2025-12-04T08:54:02.2853436Z * [new branch] gh/wconstab/466/head -> origin/gh/wconstab/466/head 2025-12-04T08:54:02.2853506Z * [new branch] gh/wconstab/466/orig -> origin/gh/wconstab/466/orig 2025-12-04T08:54:02.2853575Z * [new branch] gh/wconstab/467/base -> origin/gh/wconstab/467/base 2025-12-04T08:54:02.2853646Z * [new branch] gh/wconstab/467/head -> origin/gh/wconstab/467/head 2025-12-04T08:54:02.2853717Z * [new branch] gh/wconstab/467/orig -> origin/gh/wconstab/467/orig 2025-12-04T08:54:02.2853786Z * [new branch] gh/wconstab/468/base -> origin/gh/wconstab/468/base 2025-12-04T08:54:02.2853854Z * [new branch] gh/wconstab/468/head -> origin/gh/wconstab/468/head 2025-12-04T08:54:02.2853926Z * [new branch] gh/wconstab/468/orig -> origin/gh/wconstab/468/orig 2025-12-04T08:54:02.2853998Z * [new branch] gh/weifengpy/39/base -> origin/gh/weifengpy/39/base 2025-12-04T08:54:02.2854069Z * [new branch] gh/weifengpy/39/head -> origin/gh/weifengpy/39/head 2025-12-04T08:54:02.2854142Z * [new branch] gh/weifengpy/39/orig -> origin/gh/weifengpy/39/orig 2025-12-04T08:54:02.2854212Z * [new branch] gh/weifengpy/40/base -> origin/gh/weifengpy/40/base 2025-12-04T08:54:02.2854304Z * [new branch] gh/weifengpy/40/head -> origin/gh/weifengpy/40/head 2025-12-04T08:54:02.2854376Z * [new branch] gh/weifengpy/40/orig -> origin/gh/weifengpy/40/orig 2025-12-04T08:54:02.2854445Z * [new branch] gh/weifengpy/41/base -> origin/gh/weifengpy/41/base 2025-12-04T08:54:02.2854516Z * [new branch] gh/weifengpy/41/head -> origin/gh/weifengpy/41/head 2025-12-04T08:54:02.2854588Z * [new branch] gh/weifengpy/41/orig -> origin/gh/weifengpy/41/orig 2025-12-04T08:54:02.2854669Z * [new branch] gh/williamwen42/250/base -> origin/gh/williamwen42/250/base 2025-12-04T08:54:02.2854748Z * [new branch] gh/williamwen42/250/head -> origin/gh/williamwen42/250/head 2025-12-04T08:54:02.2854826Z * [new branch] gh/williamwen42/250/orig -> origin/gh/williamwen42/250/orig 2025-12-04T08:54:02.2854902Z * [new branch] gh/williamwen42/279/base -> origin/gh/williamwen42/279/base 2025-12-04T08:54:02.2854980Z * [new branch] gh/williamwen42/279/head -> origin/gh/williamwen42/279/head 2025-12-04T08:54:02.2855056Z * [new branch] gh/williamwen42/279/orig -> origin/gh/williamwen42/279/orig 2025-12-04T08:54:02.2855131Z * [new branch] gh/williamwen42/282/base -> origin/gh/williamwen42/282/base 2025-12-04T08:54:02.2855235Z * [new branch] gh/williamwen42/282/head -> origin/gh/williamwen42/282/head 2025-12-04T08:54:02.2855311Z * [new branch] gh/williamwen42/282/orig -> origin/gh/williamwen42/282/orig 2025-12-04T08:54:02.2855386Z * [new branch] gh/williamwen42/287/base -> origin/gh/williamwen42/287/base 2025-12-04T08:54:02.2855463Z * [new branch] gh/williamwen42/287/head -> origin/gh/williamwen42/287/head 2025-12-04T08:54:02.2855538Z * [new branch] gh/williamwen42/287/orig -> origin/gh/williamwen42/287/orig 2025-12-04T08:54:02.2855617Z * [new branch] gh/williamwen42/288/base -> origin/gh/williamwen42/288/base 2025-12-04T08:54:02.2855696Z * [new branch] gh/williamwen42/288/head -> origin/gh/williamwen42/288/head 2025-12-04T08:54:02.2855772Z * [new branch] gh/williamwen42/288/orig -> origin/gh/williamwen42/288/orig 2025-12-04T08:54:02.2855848Z * [new branch] gh/williamwen42/296/base -> origin/gh/williamwen42/296/base 2025-12-04T08:54:02.2856030Z * [new branch] gh/williamwen42/296/head -> origin/gh/williamwen42/296/head 2025-12-04T08:54:02.2856106Z * [new branch] gh/williamwen42/296/orig -> origin/gh/williamwen42/296/orig 2025-12-04T08:54:02.2856181Z * [new branch] gh/williamwen42/297/base -> origin/gh/williamwen42/297/base 2025-12-04T08:54:02.2856259Z * [new branch] gh/williamwen42/297/head -> origin/gh/williamwen42/297/head 2025-12-04T08:54:02.2856334Z * [new branch] gh/williamwen42/297/orig -> origin/gh/williamwen42/297/orig 2025-12-04T08:54:02.2856413Z * [new branch] gh/williamwen42/306/base -> origin/gh/williamwen42/306/base 2025-12-04T08:54:02.2856489Z * [new branch] gh/williamwen42/306/head -> origin/gh/williamwen42/306/head 2025-12-04T08:54:02.2856565Z * [new branch] gh/williamwen42/306/orig -> origin/gh/williamwen42/306/orig 2025-12-04T08:54:02.2856644Z * [new branch] gh/williamwen42/309/base -> origin/gh/williamwen42/309/base 2025-12-04T08:54:02.2856719Z * [new branch] gh/williamwen42/309/head -> origin/gh/williamwen42/309/head 2025-12-04T08:54:02.2856794Z * [new branch] gh/williamwen42/309/orig -> origin/gh/williamwen42/309/orig 2025-12-04T08:54:02.2856870Z * [new branch] gh/williamwen42/310/base -> origin/gh/williamwen42/310/base 2025-12-04T08:54:02.2856946Z * [new branch] gh/williamwen42/310/head -> origin/gh/williamwen42/310/head 2025-12-04T08:54:02.2857067Z * [new branch] gh/williamwen42/310/orig -> origin/gh/williamwen42/310/orig 2025-12-04T08:54:02.2857144Z * [new branch] gh/williamwen42/311/base -> origin/gh/williamwen42/311/base 2025-12-04T08:54:02.2857220Z * [new branch] gh/williamwen42/311/head -> origin/gh/williamwen42/311/head 2025-12-04T08:54:02.2857296Z * [new branch] gh/williamwen42/311/orig -> origin/gh/williamwen42/311/orig 2025-12-04T08:54:02.2857374Z * [new branch] gh/williamwen42/319/base -> origin/gh/williamwen42/319/base 2025-12-04T08:54:02.2857449Z * [new branch] gh/williamwen42/319/head -> origin/gh/williamwen42/319/head 2025-12-04T08:54:02.2857525Z * [new branch] gh/williamwen42/319/orig -> origin/gh/williamwen42/319/orig 2025-12-04T08:54:02.2857602Z * [new branch] gh/williamwen42/325/base -> origin/gh/williamwen42/325/base 2025-12-04T08:54:02.2857677Z * [new branch] gh/williamwen42/325/head -> origin/gh/williamwen42/325/head 2025-12-04T08:54:02.2857755Z * [new branch] gh/williamwen42/325/orig -> origin/gh/williamwen42/325/orig 2025-12-04T08:54:02.2857832Z * [new branch] gh/williamwen42/326/base -> origin/gh/williamwen42/326/base 2025-12-04T08:54:02.2857907Z * [new branch] gh/williamwen42/326/head -> origin/gh/williamwen42/326/head 2025-12-04T08:54:02.2858034Z * [new branch] gh/williamwen42/326/orig -> origin/gh/williamwen42/326/orig 2025-12-04T08:54:02.2858110Z * [new branch] gh/williamwen42/327/base -> origin/gh/williamwen42/327/base 2025-12-04T08:54:02.2858185Z * [new branch] gh/williamwen42/327/head -> origin/gh/williamwen42/327/head 2025-12-04T08:54:02.2858261Z * [new branch] gh/williamwen42/327/orig -> origin/gh/williamwen42/327/orig 2025-12-04T08:54:02.2858336Z * [new branch] gh/williamwen42/328/base -> origin/gh/williamwen42/328/base 2025-12-04T08:54:02.2858414Z * [new branch] gh/williamwen42/328/head -> origin/gh/williamwen42/328/head 2025-12-04T08:54:02.2858490Z * [new branch] gh/williamwen42/328/orig -> origin/gh/williamwen42/328/orig 2025-12-04T08:54:02.2858565Z * [new branch] gh/williamwen42/329/base -> origin/gh/williamwen42/329/base 2025-12-04T08:54:02.2858641Z * [new branch] gh/williamwen42/329/head -> origin/gh/williamwen42/329/head 2025-12-04T08:54:02.2858719Z * [new branch] gh/williamwen42/329/orig -> origin/gh/williamwen42/329/orig 2025-12-04T08:54:02.2858795Z * [new branch] gh/williamwen42/330/base -> origin/gh/williamwen42/330/base 2025-12-04T08:54:02.2858871Z * [new branch] gh/williamwen42/330/head -> origin/gh/williamwen42/330/head 2025-12-04T08:54:02.2858948Z * [new branch] gh/williamwen42/330/orig -> origin/gh/williamwen42/330/orig 2025-12-04T08:54:02.2859023Z * [new branch] gh/williamwen42/331/base -> origin/gh/williamwen42/331/base 2025-12-04T08:54:02.2859100Z * [new branch] gh/williamwen42/331/head -> origin/gh/williamwen42/331/head 2025-12-04T08:54:02.2859177Z * [new branch] gh/williamwen42/331/orig -> origin/gh/williamwen42/331/orig 2025-12-04T08:54:02.2859252Z * [new branch] gh/williamwen42/332/base -> origin/gh/williamwen42/332/base 2025-12-04T08:54:02.2859330Z * [new branch] gh/williamwen42/332/head -> origin/gh/williamwen42/332/head 2025-12-04T08:54:02.2859407Z * [new branch] gh/williamwen42/332/orig -> origin/gh/williamwen42/332/orig 2025-12-04T08:54:02.2859483Z * [new branch] gh/williamwen42/333/base -> origin/gh/williamwen42/333/base 2025-12-04T08:54:02.2859559Z * [new branch] gh/williamwen42/333/head -> origin/gh/williamwen42/333/head 2025-12-04T08:54:02.2859635Z * [new branch] gh/williamwen42/333/orig -> origin/gh/williamwen42/333/orig 2025-12-04T08:54:02.2859710Z * [new branch] gh/williamwen42/334/base -> origin/gh/williamwen42/334/base 2025-12-04T08:54:02.2859812Z * [new branch] gh/williamwen42/334/head -> origin/gh/williamwen42/334/head 2025-12-04T08:54:02.2859888Z * [new branch] gh/williamwen42/334/orig -> origin/gh/williamwen42/334/orig 2025-12-04T08:54:02.2859963Z * [new branch] gh/williamwen42/335/base -> origin/gh/williamwen42/335/base 2025-12-04T08:54:02.2860040Z * [new branch] gh/williamwen42/335/head -> origin/gh/williamwen42/335/head 2025-12-04T08:54:02.2860117Z * [new branch] gh/williamwen42/335/orig -> origin/gh/williamwen42/335/orig 2025-12-04T08:54:02.2860193Z * [new branch] gh/williamwen42/336/base -> origin/gh/williamwen42/336/base 2025-12-04T08:54:02.2860270Z * [new branch] gh/williamwen42/336/head -> origin/gh/williamwen42/336/head 2025-12-04T08:54:02.2860345Z * [new branch] gh/williamwen42/336/orig -> origin/gh/williamwen42/336/orig 2025-12-04T08:54:02.2860424Z * [new branch] gh/williamwen42/337/base -> origin/gh/williamwen42/337/base 2025-12-04T08:54:02.2860501Z * [new branch] gh/williamwen42/337/head -> origin/gh/williamwen42/337/head 2025-12-04T08:54:02.2860576Z * [new branch] gh/williamwen42/337/orig -> origin/gh/williamwen42/337/orig 2025-12-04T08:54:02.2860677Z * [new branch] gh/williamwen42/338/base -> origin/gh/williamwen42/338/base 2025-12-04T08:54:02.2860754Z * [new branch] gh/williamwen42/338/head -> origin/gh/williamwen42/338/head 2025-12-04T08:54:02.2860830Z * [new branch] gh/williamwen42/338/orig -> origin/gh/williamwen42/338/orig 2025-12-04T08:54:02.2860908Z * [new branch] gh/williamwen42/339/base -> origin/gh/williamwen42/339/base 2025-12-04T08:54:02.2860983Z * [new branch] gh/williamwen42/339/head -> origin/gh/williamwen42/339/head 2025-12-04T08:54:02.2861058Z * [new branch] gh/williamwen42/339/orig -> origin/gh/williamwen42/339/orig 2025-12-04T08:54:02.2861136Z * [new branch] gh/williamwen42/340/base -> origin/gh/williamwen42/340/base 2025-12-04T08:54:02.2861211Z * [new branch] gh/williamwen42/340/head -> origin/gh/williamwen42/340/head 2025-12-04T08:54:02.2861286Z * [new branch] gh/williamwen42/340/orig -> origin/gh/williamwen42/340/orig 2025-12-04T08:54:02.2861364Z * [new branch] gh/williamwen42/341/base -> origin/gh/williamwen42/341/base 2025-12-04T08:54:02.2861440Z * [new branch] gh/williamwen42/341/head -> origin/gh/williamwen42/341/head 2025-12-04T08:54:02.2861515Z * [new branch] gh/williamwen42/341/orig -> origin/gh/williamwen42/341/orig 2025-12-04T08:54:02.2861592Z * [new branch] gh/williamwen42/342/base -> origin/gh/williamwen42/342/base 2025-12-04T08:54:02.2861668Z * [new branch] gh/williamwen42/342/head -> origin/gh/williamwen42/342/head 2025-12-04T08:54:02.2861748Z * [new branch] gh/williamwen42/342/orig -> origin/gh/williamwen42/342/orig 2025-12-04T08:54:02.2861826Z * [new branch] gh/williamwen42/343/base -> origin/gh/williamwen42/343/base 2025-12-04T08:54:02.2861902Z * [new branch] gh/williamwen42/343/head -> origin/gh/williamwen42/343/head 2025-12-04T08:54:02.2861980Z * [new branch] gh/williamwen42/343/orig -> origin/gh/williamwen42/343/orig 2025-12-04T08:54:02.2862058Z * [new branch] gh/williamwen42/344/base -> origin/gh/williamwen42/344/base 2025-12-04T08:54:02.2862133Z * [new branch] gh/williamwen42/344/head -> origin/gh/williamwen42/344/head 2025-12-04T08:54:02.2862209Z * [new branch] gh/williamwen42/344/orig -> origin/gh/williamwen42/344/orig 2025-12-04T08:54:02.2862286Z * [new branch] gh/williamwen42/345/base -> origin/gh/williamwen42/345/base 2025-12-04T08:54:02.2862361Z * [new branch] gh/williamwen42/345/head -> origin/gh/williamwen42/345/head 2025-12-04T08:54:02.2862463Z * [new branch] gh/williamwen42/345/orig -> origin/gh/williamwen42/345/orig 2025-12-04T08:54:02.2862538Z * [new branch] gh/williamwen42/346/base -> origin/gh/williamwen42/346/base 2025-12-04T08:54:02.2862613Z * [new branch] gh/williamwen42/346/head -> origin/gh/williamwen42/346/head 2025-12-04T08:54:02.2862691Z * [new branch] gh/williamwen42/346/orig -> origin/gh/williamwen42/346/orig 2025-12-04T08:54:02.2862767Z * [new branch] gh/williamwen42/347/base -> origin/gh/williamwen42/347/base 2025-12-04T08:54:02.2862842Z * [new branch] gh/williamwen42/347/head -> origin/gh/williamwen42/347/head 2025-12-04T08:54:02.2862919Z * [new branch] gh/williamwen42/347/orig -> origin/gh/williamwen42/347/orig 2025-12-04T08:54:02.2862995Z * [new branch] gh/williamwen42/348/base -> origin/gh/williamwen42/348/base 2025-12-04T08:54:02.2863072Z * [new branch] gh/williamwen42/348/head -> origin/gh/williamwen42/348/head 2025-12-04T08:54:02.2863150Z * [new branch] gh/williamwen42/348/orig -> origin/gh/williamwen42/348/orig 2025-12-04T08:54:02.2863226Z * [new branch] gh/williamwen42/349/base -> origin/gh/williamwen42/349/base 2025-12-04T08:54:02.2863339Z * [new branch] gh/williamwen42/349/head -> origin/gh/williamwen42/349/head 2025-12-04T08:54:02.2863416Z * [new branch] gh/williamwen42/349/orig -> origin/gh/williamwen42/349/orig 2025-12-04T08:54:02.2863492Z * [new branch] gh/williamwen42/350/base -> origin/gh/williamwen42/350/base 2025-12-04T08:54:02.2863566Z * [new branch] gh/williamwen42/350/head -> origin/gh/williamwen42/350/head 2025-12-04T08:54:02.2863644Z * [new branch] gh/williamwen42/350/orig -> origin/gh/williamwen42/350/orig 2025-12-04T08:54:02.2863719Z * [new branch] gh/williamwen42/351/base -> origin/gh/williamwen42/351/base 2025-12-04T08:54:02.2863798Z * [new branch] gh/williamwen42/351/head -> origin/gh/williamwen42/351/head 2025-12-04T08:54:02.2863873Z * [new branch] gh/williamwen42/351/orig -> origin/gh/williamwen42/351/orig 2025-12-04T08:54:02.2863947Z * [new branch] gh/williamwen42/352/base -> origin/gh/williamwen42/352/base 2025-12-04T08:54:02.2864025Z * [new branch] gh/williamwen42/352/head -> origin/gh/williamwen42/352/head 2025-12-04T08:54:02.2864100Z * [new branch] gh/williamwen42/352/orig -> origin/gh/williamwen42/352/orig 2025-12-04T08:54:02.2864175Z * [new branch] gh/williamwen42/353/base -> origin/gh/williamwen42/353/base 2025-12-04T08:54:02.2864251Z * [new branch] gh/williamwen42/353/head -> origin/gh/williamwen42/353/head 2025-12-04T08:54:02.2864326Z * [new branch] gh/williamwen42/353/orig -> origin/gh/williamwen42/353/orig 2025-12-04T08:54:02.2864403Z * [new branch] gh/williamwen42/354/base -> origin/gh/williamwen42/354/base 2025-12-04T08:54:02.2864480Z * [new branch] gh/williamwen42/354/head -> origin/gh/williamwen42/354/head 2025-12-04T08:54:02.2864555Z * [new branch] gh/williamwen42/354/orig -> origin/gh/williamwen42/354/orig 2025-12-04T08:54:02.2864632Z * [new branch] gh/williamwen42/355/base -> origin/gh/williamwen42/355/base 2025-12-04T08:54:02.2864710Z * [new branch] gh/williamwen42/355/head -> origin/gh/williamwen42/355/head 2025-12-04T08:54:02.2864785Z * [new branch] gh/williamwen42/355/orig -> origin/gh/williamwen42/355/orig 2025-12-04T08:54:02.2864860Z * [new branch] gh/williamwen42/356/base -> origin/gh/williamwen42/356/base 2025-12-04T08:54:02.2864936Z * [new branch] gh/williamwen42/356/head -> origin/gh/williamwen42/356/head 2025-12-04T08:54:02.2865011Z * [new branch] gh/williamwen42/356/orig -> origin/gh/williamwen42/356/orig 2025-12-04T08:54:02.2865112Z * [new branch] gh/williamwen42/357/base -> origin/gh/williamwen42/357/base 2025-12-04T08:54:02.2865189Z * [new branch] gh/williamwen42/357/head -> origin/gh/williamwen42/357/head 2025-12-04T08:54:02.2865264Z * [new branch] gh/williamwen42/357/orig -> origin/gh/williamwen42/357/orig 2025-12-04T08:54:02.2865342Z * [new branch] gh/williamwen42/358/base -> origin/gh/williamwen42/358/base 2025-12-04T08:54:02.2865417Z * [new branch] gh/williamwen42/358/head -> origin/gh/williamwen42/358/head 2025-12-04T08:54:02.2865492Z * [new branch] gh/williamwen42/358/orig -> origin/gh/williamwen42/358/orig 2025-12-04T08:54:02.2865563Z * [new branch] gh/xmfan/169/base -> origin/gh/xmfan/169/base 2025-12-04T08:54:02.2865629Z * [new branch] gh/xmfan/169/head -> origin/gh/xmfan/169/head 2025-12-04T08:54:02.2865697Z * [new branch] gh/xmfan/170/base -> origin/gh/xmfan/170/base 2025-12-04T08:54:02.2865763Z * [new branch] gh/xmfan/170/head -> origin/gh/xmfan/170/head 2025-12-04T08:54:02.2865828Z * [new branch] gh/xmfan/274/base -> origin/gh/xmfan/274/base 2025-12-04T08:54:02.2865892Z * [new branch] gh/xmfan/274/head -> origin/gh/xmfan/274/head 2025-12-04T08:54:02.2866073Z * [new branch] gh/xmfan/274/orig -> origin/gh/xmfan/274/orig 2025-12-04T08:54:02.2866138Z * [new branch] gh/xmfan/277/base -> origin/gh/xmfan/277/base 2025-12-04T08:54:02.2866202Z * [new branch] gh/xmfan/277/head -> origin/gh/xmfan/277/head 2025-12-04T08:54:02.2866268Z * [new branch] gh/xmfan/277/orig -> origin/gh/xmfan/277/orig 2025-12-04T08:54:02.2866332Z * [new branch] gh/xmfan/301/base -> origin/gh/xmfan/301/base 2025-12-04T08:54:02.2866395Z * [new branch] gh/xmfan/301/head -> origin/gh/xmfan/301/head 2025-12-04T08:54:02.2866462Z * [new branch] gh/xmfan/301/orig -> origin/gh/xmfan/301/orig 2025-12-04T08:54:02.2866526Z * [new branch] gh/xmfan/304/base -> origin/gh/xmfan/304/base 2025-12-04T08:54:02.2866590Z * [new branch] gh/xmfan/304/head -> origin/gh/xmfan/304/head 2025-12-04T08:54:02.2866657Z * [new branch] gh/xmfan/304/orig -> origin/gh/xmfan/304/orig 2025-12-04T08:54:02.2866721Z * [new branch] gh/xmfan/309/base -> origin/gh/xmfan/309/base 2025-12-04T08:54:02.2866787Z * [new branch] gh/xmfan/309/head -> origin/gh/xmfan/309/head 2025-12-04T08:54:02.2866851Z * [new branch] gh/xmfan/309/orig -> origin/gh/xmfan/309/orig 2025-12-04T08:54:02.2866915Z * [new branch] gh/xmfan/310/base -> origin/gh/xmfan/310/base 2025-12-04T08:54:02.2866980Z * [new branch] gh/xmfan/310/head -> origin/gh/xmfan/310/head 2025-12-04T08:54:02.2867045Z * [new branch] gh/xmfan/310/orig -> origin/gh/xmfan/310/orig 2025-12-04T08:54:02.2867110Z * [new branch] gh/xmfan/311/base -> origin/gh/xmfan/311/base 2025-12-04T08:54:02.2867174Z * [new branch] gh/xmfan/311/head -> origin/gh/xmfan/311/head 2025-12-04T08:54:02.2867241Z * [new branch] gh/xmfan/311/orig -> origin/gh/xmfan/311/orig 2025-12-04T08:54:02.2867305Z * [new branch] gh/xmfan/312/base -> origin/gh/xmfan/312/base 2025-12-04T08:54:02.2867370Z * [new branch] gh/xmfan/312/head -> origin/gh/xmfan/312/head 2025-12-04T08:54:02.2867434Z * [new branch] gh/xmfan/312/orig -> origin/gh/xmfan/312/orig 2025-12-04T08:54:02.2867498Z * [new branch] gh/xmfan/313/base -> origin/gh/xmfan/313/base 2025-12-04T08:54:02.2867563Z * [new branch] gh/xmfan/313/head -> origin/gh/xmfan/313/head 2025-12-04T08:54:02.2867676Z * [new branch] gh/xmfan/313/orig -> origin/gh/xmfan/313/orig 2025-12-04T08:54:02.2867753Z * [new branch] gh/xuanzhang816/27/base -> origin/gh/xuanzhang816/27/base 2025-12-04T08:54:02.2867829Z * [new branch] gh/xuanzhang816/27/head -> origin/gh/xuanzhang816/27/head 2025-12-04T08:54:02.2867904Z * [new branch] gh/xuanzhang816/27/orig -> origin/gh/xuanzhang816/27/orig 2025-12-04T08:54:02.2867978Z * [new branch] gh/xuanzhang816/32/base -> origin/gh/xuanzhang816/32/base 2025-12-04T08:54:02.2868052Z * [new branch] gh/xuanzhang816/32/head -> origin/gh/xuanzhang816/32/head 2025-12-04T08:54:02.2868125Z * [new branch] gh/xuanzhang816/32/orig -> origin/gh/xuanzhang816/32/orig 2025-12-04T08:54:02.2868198Z * [new branch] gh/xuanzhang816/33/base -> origin/gh/xuanzhang816/33/base 2025-12-04T08:54:02.2868273Z * [new branch] gh/xuanzhang816/33/head -> origin/gh/xuanzhang816/33/head 2025-12-04T08:54:02.2868347Z * [new branch] gh/xuanzhang816/33/orig -> origin/gh/xuanzhang816/33/orig 2025-12-04T08:54:02.2868420Z * [new branch] gh/xuanzhang816/34/base -> origin/gh/xuanzhang816/34/base 2025-12-04T08:54:02.2868493Z * [new branch] gh/xuanzhang816/34/head -> origin/gh/xuanzhang816/34/head 2025-12-04T08:54:02.2868589Z * [new branch] gh/xuanzhang816/34/orig -> origin/gh/xuanzhang816/34/orig 2025-12-04T08:54:02.2868664Z * [new branch] gh/xuanzhang816/35/base -> origin/gh/xuanzhang816/35/base 2025-12-04T08:54:02.2868737Z * [new branch] gh/xuanzhang816/35/head -> origin/gh/xuanzhang816/35/head 2025-12-04T08:54:02.2868810Z * [new branch] gh/xuanzhang816/35/orig -> origin/gh/xuanzhang816/35/orig 2025-12-04T08:54:02.2868882Z * [new branch] gh/yanbing-j/11/base -> origin/gh/yanbing-j/11/base 2025-12-04T08:54:02.2868955Z * [new branch] gh/yanbing-j/11/head -> origin/gh/yanbing-j/11/head 2025-12-04T08:54:02.2869025Z * [new branch] gh/yanbing-j/11/orig -> origin/gh/yanbing-j/11/orig 2025-12-04T08:54:02.2869095Z * [new branch] gh/yanbing-j/12/base -> origin/gh/yanbing-j/12/base 2025-12-04T08:54:02.2869164Z * [new branch] gh/yanbing-j/12/head -> origin/gh/yanbing-j/12/head 2025-12-04T08:54:02.2869234Z * [new branch] gh/yanbing-j/12/orig -> origin/gh/yanbing-j/12/orig 2025-12-04T08:54:02.2869303Z * [new branch] gh/yanbing-j/13/base -> origin/gh/yanbing-j/13/base 2025-12-04T08:54:02.2869371Z * [new branch] gh/yanbing-j/13/head -> origin/gh/yanbing-j/13/head 2025-12-04T08:54:02.2869438Z * [new branch] gh/yanbing-j/13/orig -> origin/gh/yanbing-j/13/orig 2025-12-04T08:54:02.2869508Z * [new branch] gh/yanbing-j/14/base -> origin/gh/yanbing-j/14/base 2025-12-04T08:54:02.2869578Z * [new branch] gh/yanbing-j/14/head -> origin/gh/yanbing-j/14/head 2025-12-04T08:54:02.2869647Z * [new branch] gh/yanbing-j/14/orig -> origin/gh/yanbing-j/14/orig 2025-12-04T08:54:02.2869717Z * [new branch] gh/yanbing-j/15/base -> origin/gh/yanbing-j/15/base 2025-12-04T08:54:02.2869785Z * [new branch] gh/yanbing-j/15/head -> origin/gh/yanbing-j/15/head 2025-12-04T08:54:02.2869854Z * [new branch] gh/yanbing-j/15/orig -> origin/gh/yanbing-j/15/orig 2025-12-04T08:54:02.2869924Z * [new branch] gh/yanbing-j/18/base -> origin/gh/yanbing-j/18/base 2025-12-04T08:54:02.2869991Z * [new branch] gh/yanbing-j/18/head -> origin/gh/yanbing-j/18/head 2025-12-04T08:54:02.2870061Z * [new branch] gh/yanbing-j/18/orig -> origin/gh/yanbing-j/18/orig 2025-12-04T08:54:02.2870128Z * [new branch] gh/yanbing-j/19/base -> origin/gh/yanbing-j/19/base 2025-12-04T08:54:02.2870255Z * [new branch] gh/yanbing-j/19/head -> origin/gh/yanbing-j/19/head 2025-12-04T08:54:02.2870324Z * [new branch] gh/yanbing-j/19/orig -> origin/gh/yanbing-j/19/orig 2025-12-04T08:54:02.2870392Z * [new branch] gh/yanbing-j/20/base -> origin/gh/yanbing-j/20/base 2025-12-04T08:54:02.2870460Z * [new branch] gh/yanbing-j/20/head -> origin/gh/yanbing-j/20/head 2025-12-04T08:54:02.2870531Z * [new branch] gh/yanbing-j/20/orig -> origin/gh/yanbing-j/20/orig 2025-12-04T08:54:02.2870599Z * [new branch] gh/yanbing-j/21/base -> origin/gh/yanbing-j/21/base 2025-12-04T08:54:02.2870667Z * [new branch] gh/yanbing-j/21/head -> origin/gh/yanbing-j/21/head 2025-12-04T08:54:02.2870736Z * [new branch] gh/yanbing-j/22/base -> origin/gh/yanbing-j/22/base 2025-12-04T08:54:02.2870804Z * [new branch] gh/yanbing-j/22/head -> origin/gh/yanbing-j/22/head 2025-12-04T08:54:02.2870875Z * [new branch] gh/yanbing-j/22/orig -> origin/gh/yanbing-j/22/orig 2025-12-04T08:54:02.2870944Z * [new branch] gh/yanbing-j/23/base -> origin/gh/yanbing-j/23/base 2025-12-04T08:54:02.2871012Z * [new branch] gh/yanbing-j/23/head -> origin/gh/yanbing-j/23/head 2025-12-04T08:54:02.2871108Z * [new branch] gh/yanbing-j/23/orig -> origin/gh/yanbing-j/23/orig 2025-12-04T08:54:02.2871178Z * [new branch] gh/yanbing-j/24/base -> origin/gh/yanbing-j/24/base 2025-12-04T08:54:02.2871245Z * [new branch] gh/yanbing-j/24/head -> origin/gh/yanbing-j/24/head 2025-12-04T08:54:02.2871314Z * [new branch] gh/yanbing-j/24/orig -> origin/gh/yanbing-j/24/orig 2025-12-04T08:54:02.2871383Z * [new branch] gh/yanbing-j/25/base -> origin/gh/yanbing-j/25/base 2025-12-04T08:54:02.2871452Z * [new branch] gh/yanbing-j/25/head -> origin/gh/yanbing-j/25/head 2025-12-04T08:54:02.2871522Z * [new branch] gh/yanbing-j/25/orig -> origin/gh/yanbing-j/25/orig 2025-12-04T08:54:02.2871591Z * [new branch] gh/yanbing-j/26/base -> origin/gh/yanbing-j/26/base 2025-12-04T08:54:02.2871659Z * [new branch] gh/yanbing-j/26/head -> origin/gh/yanbing-j/26/head 2025-12-04T08:54:02.2871729Z * [new branch] gh/yanbing-j/26/orig -> origin/gh/yanbing-j/26/orig 2025-12-04T08:54:02.2871807Z * [new branch] gh/yang-yu-hang/1/base -> origin/gh/yang-yu-hang/1/base 2025-12-04T08:54:02.2871882Z * [new branch] gh/yang-yu-hang/1/head -> origin/gh/yang-yu-hang/1/head 2025-12-04T08:54:02.2871956Z * [new branch] gh/yang-yu-hang/1/orig -> origin/gh/yang-yu-hang/1/orig 2025-12-04T08:54:02.2872028Z * [new branch] gh/yang-yu-hang/2/base -> origin/gh/yang-yu-hang/2/base 2025-12-04T08:54:02.2872100Z * [new branch] gh/yang-yu-hang/2/head -> origin/gh/yang-yu-hang/2/head 2025-12-04T08:54:02.2872173Z * [new branch] gh/yang-yu-hang/2/orig -> origin/gh/yang-yu-hang/2/orig 2025-12-04T08:54:02.2872244Z * [new branch] gh/yang-yu-hang/3/base -> origin/gh/yang-yu-hang/3/base 2025-12-04T08:54:02.2872316Z * [new branch] gh/yang-yu-hang/3/head -> origin/gh/yang-yu-hang/3/head 2025-12-04T08:54:02.2872389Z * [new branch] gh/yang-yu-hang/3/orig -> origin/gh/yang-yu-hang/3/orig 2025-12-04T08:54:02.2872460Z * [new branch] gh/yangw-dev/12/base -> origin/gh/yangw-dev/12/base 2025-12-04T08:54:02.2872530Z * [new branch] gh/yangw-dev/12/head -> origin/gh/yangw-dev/12/head 2025-12-04T08:54:02.2872601Z * [new branch] gh/yangw-dev/12/orig -> origin/gh/yangw-dev/12/orig 2025-12-04T08:54:02.2872670Z * [new branch] gh/yangw-dev/13/base -> origin/gh/yangw-dev/13/base 2025-12-04T08:54:02.2872739Z * [new branch] gh/yangw-dev/13/head -> origin/gh/yangw-dev/13/head 2025-12-04T08:54:02.2872847Z * [new branch] gh/yangw-dev/13/orig -> origin/gh/yangw-dev/13/orig 2025-12-04T08:54:02.2872915Z * [new branch] gh/yangw-dev/14/base -> origin/gh/yangw-dev/14/base 2025-12-04T08:54:02.2872985Z * [new branch] gh/yangw-dev/14/head -> origin/gh/yangw-dev/14/head 2025-12-04T08:54:02.2873055Z * [new branch] gh/yangw-dev/14/orig -> origin/gh/yangw-dev/14/orig 2025-12-04T08:54:02.2873123Z * [new branch] gh/yangw-dev/15/base -> origin/gh/yangw-dev/15/base 2025-12-04T08:54:02.2873194Z * [new branch] gh/yangw-dev/15/head -> origin/gh/yangw-dev/15/head 2025-12-04T08:54:02.2873311Z * [new branch] gh/yangw-dev/15/orig -> origin/gh/yangw-dev/15/orig 2025-12-04T08:54:02.2873379Z * [new branch] gh/yangw-dev/19/base -> origin/gh/yangw-dev/19/base 2025-12-04T08:54:02.2873450Z * [new branch] gh/yangw-dev/19/head -> origin/gh/yangw-dev/19/head 2025-12-04T08:54:02.2873521Z * [new branch] gh/yangw-dev/19/orig -> origin/gh/yangw-dev/19/orig 2025-12-04T08:54:02.2873590Z * [new branch] gh/yangw-dev/26/base -> origin/gh/yangw-dev/26/base 2025-12-04T08:54:02.2873659Z * [new branch] gh/yangw-dev/26/head -> origin/gh/yangw-dev/26/head 2025-12-04T08:54:02.2873766Z * [new branch] gh/yangw-dev/26/orig -> origin/gh/yangw-dev/26/orig 2025-12-04T08:54:02.2873834Z * [new branch] gh/yangw-dev/27/base -> origin/gh/yangw-dev/27/base 2025-12-04T08:54:02.2873903Z * [new branch] gh/yangw-dev/27/head -> origin/gh/yangw-dev/27/head 2025-12-04T08:54:02.2873972Z * [new branch] gh/yangw-dev/27/orig -> origin/gh/yangw-dev/27/orig 2025-12-04T08:54:02.2874039Z * [new branch] gh/ydwu4/292/base -> origin/gh/ydwu4/292/base 2025-12-04T08:54:02.2874107Z * [new branch] gh/ydwu4/292/head -> origin/gh/ydwu4/292/head 2025-12-04T08:54:02.2874173Z * [new branch] gh/ydwu4/292/orig -> origin/gh/ydwu4/292/orig 2025-12-04T08:54:02.2874237Z * [new branch] gh/ydwu4/294/base -> origin/gh/ydwu4/294/base 2025-12-04T08:54:02.2874303Z * [new branch] gh/ydwu4/294/head -> origin/gh/ydwu4/294/head 2025-12-04T08:54:02.2874368Z * [new branch] gh/ydwu4/294/orig -> origin/gh/ydwu4/294/orig 2025-12-04T08:54:02.2874433Z * [new branch] gh/ydwu4/295/base -> origin/gh/ydwu4/295/base 2025-12-04T08:54:02.2874499Z * [new branch] gh/ydwu4/295/head -> origin/gh/ydwu4/295/head 2025-12-04T08:54:02.2874562Z * [new branch] gh/ydwu4/295/orig -> origin/gh/ydwu4/295/orig 2025-12-04T08:54:02.2874629Z * [new branch] gh/ydwu4/296/base -> origin/gh/ydwu4/296/base 2025-12-04T08:54:02.2874693Z * [new branch] gh/ydwu4/296/head -> origin/gh/ydwu4/296/head 2025-12-04T08:54:02.2874758Z * [new branch] gh/ydwu4/296/orig -> origin/gh/ydwu4/296/orig 2025-12-04T08:54:02.2874824Z * [new branch] gh/ydwu4/306/base -> origin/gh/ydwu4/306/base 2025-12-04T08:54:02.2874888Z * [new branch] gh/ydwu4/306/head -> origin/gh/ydwu4/306/head 2025-12-04T08:54:02.2874953Z * [new branch] gh/ydwu4/306/orig -> origin/gh/ydwu4/306/orig 2025-12-04T08:54:02.2875018Z * [new branch] gh/ydwu4/312/base -> origin/gh/ydwu4/312/base 2025-12-04T08:54:02.2875082Z * [new branch] gh/ydwu4/312/head -> origin/gh/ydwu4/312/head 2025-12-04T08:54:02.2875146Z * [new branch] gh/ydwu4/312/orig -> origin/gh/ydwu4/312/orig 2025-12-04T08:54:02.2875211Z * [new branch] gh/ydwu4/322/base -> origin/gh/ydwu4/322/base 2025-12-04T08:54:02.2875275Z * [new branch] gh/ydwu4/322/head -> origin/gh/ydwu4/322/head 2025-12-04T08:54:02.2875371Z * [new branch] gh/ydwu4/322/orig -> origin/gh/ydwu4/322/orig 2025-12-04T08:54:02.2875436Z * [new branch] gh/ydwu4/327/base -> origin/gh/ydwu4/327/base 2025-12-04T08:54:02.2875500Z * [new branch] gh/ydwu4/327/head -> origin/gh/ydwu4/327/head 2025-12-04T08:54:02.2875566Z * [new branch] gh/ydwu4/327/orig -> origin/gh/ydwu4/327/orig 2025-12-04T08:54:02.2875631Z * [new branch] gh/ydwu4/328/base -> origin/gh/ydwu4/328/base 2025-12-04T08:54:02.2875697Z * [new branch] gh/ydwu4/328/head -> origin/gh/ydwu4/328/head 2025-12-04T08:54:02.2875761Z * [new branch] gh/ydwu4/328/orig -> origin/gh/ydwu4/328/orig 2025-12-04T08:54:02.2875826Z * [new branch] gh/ydwu4/329/base -> origin/gh/ydwu4/329/base 2025-12-04T08:54:02.2875890Z * [new branch] gh/ydwu4/329/head -> origin/gh/ydwu4/329/head 2025-12-04T08:54:02.2876000Z * [new branch] gh/ydwu4/329/orig -> origin/gh/ydwu4/329/orig 2025-12-04T08:54:02.2876065Z * [new branch] gh/ydwu4/330/base -> origin/gh/ydwu4/330/base 2025-12-04T08:54:02.2876129Z * [new branch] gh/ydwu4/330/head -> origin/gh/ydwu4/330/head 2025-12-04T08:54:02.2876230Z * [new branch] gh/ydwu4/330/orig -> origin/gh/ydwu4/330/orig 2025-12-04T08:54:02.2876295Z * [new branch] gh/ydwu4/331/base -> origin/gh/ydwu4/331/base 2025-12-04T08:54:02.2876360Z * [new branch] gh/ydwu4/331/head -> origin/gh/ydwu4/331/head 2025-12-04T08:54:02.2876425Z * [new branch] gh/ydwu4/331/orig -> origin/gh/ydwu4/331/orig 2025-12-04T08:54:02.2876489Z * [new branch] gh/ydwu4/332/base -> origin/gh/ydwu4/332/base 2025-12-04T08:54:02.2876553Z * [new branch] gh/ydwu4/332/head -> origin/gh/ydwu4/332/head 2025-12-04T08:54:02.2876619Z * [new branch] gh/ydwu4/332/orig -> origin/gh/ydwu4/332/orig 2025-12-04T08:54:02.2876684Z * [new branch] gh/ydwu4/333/base -> origin/gh/ydwu4/333/base 2025-12-04T08:54:02.2876747Z * [new branch] gh/ydwu4/333/head -> origin/gh/ydwu4/333/head 2025-12-04T08:54:02.2876812Z * [new branch] gh/ydwu4/333/orig -> origin/gh/ydwu4/333/orig 2025-12-04T08:54:02.2876877Z * [new branch] gh/ydwu4/334/base -> origin/gh/ydwu4/334/base 2025-12-04T08:54:02.2876942Z * [new branch] gh/ydwu4/334/head -> origin/gh/ydwu4/334/head 2025-12-04T08:54:02.2877006Z * [new branch] gh/ydwu4/334/orig -> origin/gh/ydwu4/334/orig 2025-12-04T08:54:02.2877070Z * [new branch] gh/ydwu4/335/base -> origin/gh/ydwu4/335/base 2025-12-04T08:54:02.2877134Z * [new branch] gh/ydwu4/335/head -> origin/gh/ydwu4/335/head 2025-12-04T08:54:02.2877201Z * [new branch] gh/ydwu4/335/orig -> origin/gh/ydwu4/335/orig 2025-12-04T08:54:02.2877265Z * [new branch] gh/ydwu4/337/base -> origin/gh/ydwu4/337/base 2025-12-04T08:54:02.2877330Z * [new branch] gh/ydwu4/337/head -> origin/gh/ydwu4/337/head 2025-12-04T08:54:02.2877395Z * [new branch] gh/ydwu4/337/orig -> origin/gh/ydwu4/337/orig 2025-12-04T08:54:02.2877460Z * [new branch] gh/ydwu4/339/base -> origin/gh/ydwu4/339/base 2025-12-04T08:54:02.2877524Z * [new branch] gh/ydwu4/339/head -> origin/gh/ydwu4/339/head 2025-12-04T08:54:02.2877588Z * [new branch] gh/ydwu4/339/orig -> origin/gh/ydwu4/339/orig 2025-12-04T08:54:02.2877651Z * [new branch] gh/yf225/133/base -> origin/gh/yf225/133/base 2025-12-04T08:54:02.2877715Z * [new branch] gh/yf225/133/head -> origin/gh/yf225/133/head 2025-12-04T08:54:02.2877823Z * [new branch] gh/yf225/93/base -> origin/gh/yf225/93/base 2025-12-04T08:54:02.2877888Z * [new branch] gh/yf225/93/head -> origin/gh/yf225/93/head 2025-12-04T08:54:02.2877960Z * [new branch] gh/yifuwang/152/base -> origin/gh/yifuwang/152/base 2025-12-04T08:54:02.2878031Z * [new branch] gh/yifuwang/152/head -> origin/gh/yifuwang/152/head 2025-12-04T08:54:02.2878102Z * [new branch] gh/yifuwang/152/orig -> origin/gh/yifuwang/152/orig 2025-12-04T08:54:02.2878174Z * [new branch] gh/yifuwang/195/base -> origin/gh/yifuwang/195/base 2025-12-04T08:54:02.2878243Z * [new branch] gh/yifuwang/195/head -> origin/gh/yifuwang/195/head 2025-12-04T08:54:02.2878311Z * [new branch] gh/yifuwang/195/orig -> origin/gh/yifuwang/195/orig 2025-12-04T08:54:02.2878384Z * [new branch] gh/yiming0416/1/base -> origin/gh/yiming0416/1/base 2025-12-04T08:54:02.2878456Z * [new branch] gh/yiming0416/1/head -> origin/gh/yiming0416/1/head 2025-12-04T08:54:02.2878524Z * [new branch] gh/yiming0416/2/base -> origin/gh/yiming0416/2/base 2025-12-04T08:54:02.2878594Z * [new branch] gh/yiming0416/2/head -> origin/gh/yiming0416/2/head 2025-12-04T08:54:02.2878666Z * [new branch] gh/yushangdi/1/base -> origin/gh/yushangdi/1/base 2025-12-04T08:54:02.2878764Z * [new branch] gh/yushangdi/1/head -> origin/gh/yushangdi/1/head 2025-12-04T08:54:02.2878836Z * [new branch] gh/yushangdi/10/base -> origin/gh/yushangdi/10/base 2025-12-04T08:54:02.2878907Z * [new branch] gh/yushangdi/10/head -> origin/gh/yushangdi/10/head 2025-12-04T08:54:02.2878977Z * [new branch] gh/yushangdi/10/orig -> origin/gh/yushangdi/10/orig 2025-12-04T08:54:02.2879048Z * [new branch] gh/yushangdi/11/base -> origin/gh/yushangdi/11/base 2025-12-04T08:54:02.2879119Z * [new branch] gh/yushangdi/11/head -> origin/gh/yushangdi/11/head 2025-12-04T08:54:02.2879189Z * [new branch] gh/yushangdi/11/orig -> origin/gh/yushangdi/11/orig 2025-12-04T08:54:02.2879260Z * [new branch] gh/yushangdi/2/base -> origin/gh/yushangdi/2/base 2025-12-04T08:54:02.2879331Z * [new branch] gh/yushangdi/2/head -> origin/gh/yushangdi/2/head 2025-12-04T08:54:02.2879402Z * [new branch] gh/yushangdi/7/base -> origin/gh/yushangdi/7/base 2025-12-04T08:54:02.2879472Z * [new branch] gh/yushangdi/7/head -> origin/gh/yushangdi/7/head 2025-12-04T08:54:02.2879540Z * [new branch] gh/yushangdi/7/orig -> origin/gh/yushangdi/7/orig 2025-12-04T08:54:02.2879610Z * [new branch] gh/yushangdi/8/base -> origin/gh/yushangdi/8/base 2025-12-04T08:54:02.2879678Z * [new branch] gh/yushangdi/8/head -> origin/gh/yushangdi/8/head 2025-12-04T08:54:02.2879749Z * [new branch] gh/yushangdi/8/orig -> origin/gh/yushangdi/8/orig 2025-12-04T08:54:02.2879818Z * [new branch] gh/yushangdi/9/base -> origin/gh/yushangdi/9/base 2025-12-04T08:54:02.2879887Z * [new branch] gh/yushangdi/9/head -> origin/gh/yushangdi/9/head 2025-12-04T08:54:02.2879955Z * [new branch] gh/yushangdi/9/orig -> origin/gh/yushangdi/9/orig 2025-12-04T08:54:02.2880024Z * [new branch] gh/zklaus/19/base -> origin/gh/zklaus/19/base 2025-12-04T08:54:02.2880091Z * [new branch] gh/zklaus/19/head -> origin/gh/zklaus/19/head 2025-12-04T08:54:02.2880156Z * [new branch] gh/zklaus/19/orig -> origin/gh/zklaus/19/orig 2025-12-04T08:54:02.2880223Z * [new branch] gh/zklaus/20/base -> origin/gh/zklaus/20/base 2025-12-04T08:54:02.2880288Z * [new branch] gh/zklaus/20/head -> origin/gh/zklaus/20/head 2025-12-04T08:54:02.2880380Z * [new branch] gh/zklaus/20/orig -> origin/gh/zklaus/20/orig 2025-12-04T08:54:02.2880447Z * [new branch] gh/zklaus/21/base -> origin/gh/zklaus/21/base 2025-12-04T08:54:02.2880512Z * [new branch] gh/zklaus/21/head -> origin/gh/zklaus/21/head 2025-12-04T08:54:02.2880576Z * [new branch] gh/zklaus/21/orig -> origin/gh/zklaus/21/orig 2025-12-04T08:54:02.2880642Z * [new branch] gh/zklaus/22/base -> origin/gh/zklaus/22/base 2025-12-04T08:54:02.2880707Z * [new branch] gh/zklaus/22/head -> origin/gh/zklaus/22/head 2025-12-04T08:54:02.2880772Z * [new branch] gh/zklaus/22/orig -> origin/gh/zklaus/22/orig 2025-12-04T08:54:02.2880838Z * [new branch] gh/zklaus/23/base -> origin/gh/zklaus/23/base 2025-12-04T08:54:02.2880903Z * [new branch] gh/zklaus/23/head -> origin/gh/zklaus/23/head 2025-12-04T08:54:02.2880969Z * [new branch] gh/zklaus/23/orig -> origin/gh/zklaus/23/orig 2025-12-04T08:54:02.2881034Z * [new branch] gh/zklaus/24/base -> origin/gh/zklaus/24/base 2025-12-04T08:54:02.2881099Z * [new branch] gh/zklaus/24/head -> origin/gh/zklaus/24/head 2025-12-04T08:54:02.2881165Z * [new branch] gh/zklaus/24/orig -> origin/gh/zklaus/24/orig 2025-12-04T08:54:02.2881266Z * [new branch] gh/zou3519/1197/base -> origin/gh/zou3519/1197/base 2025-12-04T08:54:02.2881336Z * [new branch] gh/zou3519/1197/head -> origin/gh/zou3519/1197/head 2025-12-04T08:54:02.2881406Z * [new branch] gh/zou3519/1197/orig -> origin/gh/zou3519/1197/orig 2025-12-04T08:54:02.2881473Z * [new branch] gh/zou3519/1199/base -> origin/gh/zou3519/1199/base 2025-12-04T08:54:02.2881540Z * [new branch] gh/zou3519/1199/head -> origin/gh/zou3519/1199/head 2025-12-04T08:54:02.2881610Z * [new branch] gh/zou3519/1199/orig -> origin/gh/zou3519/1199/orig 2025-12-04T08:54:02.2881677Z * [new branch] gh/zou3519/1200/base -> origin/gh/zou3519/1200/base 2025-12-04T08:54:02.2881743Z * [new branch] gh/zou3519/1200/head -> origin/gh/zou3519/1200/head 2025-12-04T08:54:02.2881810Z * [new branch] gh/zou3519/1200/orig -> origin/gh/zou3519/1200/orig 2025-12-04T08:54:02.2881879Z * [new branch] gh/zou3519/1201/base -> origin/gh/zou3519/1201/base 2025-12-04T08:54:02.2881947Z * [new branch] gh/zou3519/1201/head -> origin/gh/zou3519/1201/head 2025-12-04T08:54:02.2882016Z * [new branch] gh/zou3519/1201/orig -> origin/gh/zou3519/1201/orig 2025-12-04T08:54:02.2882082Z * [new branch] gh/zou3519/1202/base -> origin/gh/zou3519/1202/base 2025-12-04T08:54:02.2882149Z * [new branch] gh/zou3519/1202/head -> origin/gh/zou3519/1202/head 2025-12-04T08:54:02.2882218Z * [new branch] gh/zou3519/1202/orig -> origin/gh/zou3519/1202/orig 2025-12-04T08:54:02.2882286Z * [new branch] gh/zpcore/1/base -> origin/gh/zpcore/1/base 2025-12-04T08:54:02.2882353Z * [new branch] gh/zpcore/1/head -> origin/gh/zpcore/1/head 2025-12-04T08:54:02.2882421Z * [new branch] gh/zpcore/11/base -> origin/gh/zpcore/11/base 2025-12-04T08:54:02.2882488Z * [new branch] gh/zpcore/11/head -> origin/gh/zpcore/11/head 2025-12-04T08:54:02.2882554Z * [new branch] gh/zpcore/11/orig -> origin/gh/zpcore/11/orig 2025-12-04T08:54:02.2882619Z * [new branch] gh/zpcore/12/base -> origin/gh/zpcore/12/base 2025-12-04T08:54:02.2882684Z * [new branch] gh/zpcore/12/head -> origin/gh/zpcore/12/head 2025-12-04T08:54:02.2882750Z * [new branch] gh/zpcore/12/orig -> origin/gh/zpcore/12/orig 2025-12-04T08:54:02.2882855Z * [new branch] gh/zpcore/13/base -> origin/gh/zpcore/13/base 2025-12-04T08:54:02.2882921Z * [new branch] gh/zpcore/13/head -> origin/gh/zpcore/13/head 2025-12-04T08:54:02.2882987Z * [new branch] gh/zpcore/13/orig -> origin/gh/zpcore/13/orig 2025-12-04T08:54:02.2883052Z * [new branch] gh/zpcore/14/base -> origin/gh/zpcore/14/base 2025-12-04T08:54:02.2883118Z * [new branch] gh/zpcore/14/head -> origin/gh/zpcore/14/head 2025-12-04T08:54:02.2883185Z * [new branch] gh/zpcore/14/orig -> origin/gh/zpcore/14/orig 2025-12-04T08:54:02.2883250Z * [new branch] gh/zpcore/15/base -> origin/gh/zpcore/15/base 2025-12-04T08:54:02.2883315Z * [new branch] gh/zpcore/15/head -> origin/gh/zpcore/15/head 2025-12-04T08:54:02.2883384Z * [new branch] gh/zpcore/15/orig -> origin/gh/zpcore/15/orig 2025-12-04T08:54:02.2883449Z * [new branch] gh/zpcore/2/base -> origin/gh/zpcore/2/base 2025-12-04T08:54:02.2883517Z * [new branch] gh/zpcore/2/head -> origin/gh/zpcore/2/head 2025-12-04T08:54:02.2883585Z * [new branch] gh/zpcore/21/base -> origin/gh/zpcore/21/base 2025-12-04T08:54:02.2883650Z * [new branch] gh/zpcore/21/head -> origin/gh/zpcore/21/head 2025-12-04T08:54:02.2883742Z * [new branch] gh/zpcore/21/orig -> origin/gh/zpcore/21/orig 2025-12-04T08:54:02.2883809Z * [new branch] gh/zpcore/22/base -> origin/gh/zpcore/22/base 2025-12-04T08:54:02.2883874Z * [new branch] gh/zpcore/22/head -> origin/gh/zpcore/22/head 2025-12-04T08:54:02.2883939Z * [new branch] gh/zpcore/22/orig -> origin/gh/zpcore/22/orig 2025-12-04T08:54:02.2884005Z * [new branch] gh/zpcore/23/base -> origin/gh/zpcore/23/base 2025-12-04T08:54:02.2884070Z * [new branch] gh/zpcore/23/head -> origin/gh/zpcore/23/head 2025-12-04T08:54:02.2884139Z * [new branch] gh/zpcore/23/orig -> origin/gh/zpcore/23/orig 2025-12-04T08:54:02.2884203Z * [new branch] gh/zpcore/24/base -> origin/gh/zpcore/24/base 2025-12-04T08:54:02.2884268Z * [new branch] gh/zpcore/24/head -> origin/gh/zpcore/24/head 2025-12-04T08:54:02.2884336Z * [new branch] gh/zpcore/24/orig -> origin/gh/zpcore/24/orig 2025-12-04T08:54:02.2884401Z * [new branch] gh/zpcore/25/base -> origin/gh/zpcore/25/base 2025-12-04T08:54:02.2884465Z * [new branch] gh/zpcore/25/head -> origin/gh/zpcore/25/head 2025-12-04T08:54:02.2884531Z * [new branch] gh/zpcore/25/orig -> origin/gh/zpcore/25/orig 2025-12-04T08:54:02.2884596Z * [new branch] gh/zpcore/26/base -> origin/gh/zpcore/26/base 2025-12-04T08:54:02.2884660Z * [new branch] gh/zpcore/26/head -> origin/gh/zpcore/26/head 2025-12-04T08:54:02.2884727Z * [new branch] gh/zpcore/26/orig -> origin/gh/zpcore/26/orig 2025-12-04T08:54:02.2884792Z * [new branch] gh/zpcore/27/base -> origin/gh/zpcore/27/base 2025-12-04T08:54:02.2884857Z * [new branch] gh/zpcore/27/head -> origin/gh/zpcore/27/head 2025-12-04T08:54:02.2884924Z * [new branch] gh/zpcore/27/orig -> origin/gh/zpcore/27/orig 2025-12-04T08:54:02.2884989Z * [new branch] gh/zpcore/28/base -> origin/gh/zpcore/28/base 2025-12-04T08:54:02.2885054Z * [new branch] gh/zpcore/28/head -> origin/gh/zpcore/28/head 2025-12-04T08:54:02.2885121Z * [new branch] gh/zpcore/28/orig -> origin/gh/zpcore/28/orig 2025-12-04T08:54:02.2885186Z * [new branch] gh/zpcore/3/base -> origin/gh/zpcore/3/base 2025-12-04T08:54:02.2885251Z * [new branch] gh/zpcore/3/head -> origin/gh/zpcore/3/head 2025-12-04T08:54:02.2885346Z * [new branch] gh/zpcore/4/base -> origin/gh/zpcore/4/base 2025-12-04T08:54:02.2885411Z * [new branch] gh/zpcore/4/head -> origin/gh/zpcore/4/head 2025-12-04T08:54:02.2885475Z * [new branch] gh/zpcore/5/base -> origin/gh/zpcore/5/base 2025-12-04T08:54:02.2885541Z * [new branch] gh/zpcore/5/head -> origin/gh/zpcore/5/head 2025-12-04T08:54:02.2885607Z * [new branch] gh/zpcore/6/base -> origin/gh/zpcore/6/base 2025-12-04T08:54:02.2885673Z * [new branch] gh/zpcore/6/head -> origin/gh/zpcore/6/head 2025-12-04T08:54:02.2885737Z * [new branch] gh/zpcore/7/base -> origin/gh/zpcore/7/base 2025-12-04T08:54:02.2885801Z * [new branch] gh/zpcore/7/head -> origin/gh/zpcore/7/head 2025-12-04T08:54:02.2885867Z * [new branch] gh/zpcore/8/base -> origin/gh/zpcore/8/base 2025-12-04T08:54:02.2885971Z * [new branch] gh/zpcore/8/head -> origin/gh/zpcore/8/head 2025-12-04T08:54:02.2886038Z * [new branch] google-main -> origin/google-main 2025-12-04T08:54:02.2886123Z * [new branch] guangyey/external_stream -> origin/guangyey/external_stream 2025-12-04T08:54:02.2886193Z * [new branch] guangyey/test_2025 -> origin/guangyey/test_2025 2025-12-04T08:54:02.2886382Z * [new branch] guilhermeleobas/cherry-pick-55d87d9dfd9 -> origin/guilhermeleobas/cherry-pick-55d87d9dfd9 2025-12-04T08:54:02.2886499Z * [new branch] hameerabbasi/complex_tensor_subclass -> origin/hameerabbasi/complex_tensor_subclass 2025-12-04T08:54:02.2886636Z * [new branch] hameerabbasi/fix-ctensor-gradcheck-tests -> origin/hameerabbasi/fix-ctensor-gradcheck-tests 2025-12-04T08:54:02.2886742Z * [new branch] hameerabbasi/gradcheck-allclose -> origin/hameerabbasi/gradcheck-allclose 2025-12-04T08:54:02.2886809Z * [new branch] hc_baseline -> origin/hc_baseline 2025-12-04T08:54:02.2886872Z * [new branch] hhh_rand -> origin/hhh_rand 2025-12-04T08:54:02.2886933Z * [new branch] huba/f1 -> origin/huba/f1 2025-12-04T08:54:02.2887119Z * [new branch] increase-timeout-linux-jammy-cuda12_8-py3_10-gcc11-test -> origin/increase-timeout-linux-jammy-cuda12_8-py3_10-gcc11-test 2025-12-04T08:54:02.2887181Z * [new branch] inlining -> origin/inlining 2025-12-04T08:54:02.2887250Z * [new branch] inlining-ezyang -> origin/inlining-ezyang 2025-12-04T08:54:02.2887333Z * [new branch] install-torchao-0.13.0 -> origin/install-torchao-0.13.0 2025-12-04T08:54:02.2887507Z * [new branch] instrument-trunk-pull-linux-with-job-test-filters -> origin/instrument-trunk-pull-linux-with-job-test-filters 2025-12-04T08:54:02.2887581Z * [new branch] invoke-subgraph -> origin/invoke-subgraph 2025-12-04T08:54:02.2887645Z * [new branch] issue#58739 -> origin/issue#58739 2025-12-04T08:54:02.2887724Z * [new branch] jainapurva-patch-1 -> origin/jainapurva-patch-1 2025-12-04T08:54:02.2887785Z * [new branch] jathu/o3 -> origin/jathu/o3 2025-12-04T08:54:02.2887846Z * [new branch] jathu/sve -> origin/jathu/sve 2025-12-04T08:54:02.2887968Z * [new branch] jcaip/test-cusparselt-version-0.6.2 -> origin/jcaip/test-cusparselt-version-0.6.2 2025-12-04T08:54:02.2888072Z * [new branch] jcaip/update-cusparselt-0.6.2 -> origin/jcaip/update-cusparselt-0.6.2 2025-12-04T08:54:02.2888182Z * [new branch] jiannanWang/memorysnapshot_filter -> origin/jiannanWang/memorysnapshot_filter 2025-12-04T08:54:02.2888289Z * [new branch] jiannanWang/profilerstepwarning -> origin/jiannanWang/profilerstepwarning 2025-12-04T08:54:02.2888420Z * [new branch] jithunnair-amd-patch-1 -> origin/jithunnair-amd-patch-1 2025-12-04T08:54:02.2888502Z * [new branch] jithunnair-amd-patch-10 -> origin/jithunnair-amd-patch-10 2025-12-04T08:54:02.2888583Z * [new branch] jithunnair-amd-patch-2 -> origin/jithunnair-amd-patch-2 2025-12-04T08:54:02.2888665Z * [new branch] jithunnair-amd-patch-3 -> origin/jithunnair-amd-patch-3 2025-12-04T08:54:02.2888743Z * [new branch] jithunnair-amd-patch-4 -> origin/jithunnair-amd-patch-4 2025-12-04T08:54:02.2888820Z * [new branch] jithunnair-amd-patch-5 -> origin/jithunnair-amd-patch-5 2025-12-04T08:54:02.2888900Z * [new branch] jithunnair-amd-patch-6 -> origin/jithunnair-amd-patch-6 2025-12-04T08:54:02.2888977Z * [new branch] jithunnair-amd-patch-7 -> origin/jithunnair-amd-patch-7 2025-12-04T08:54:02.2889056Z * [new branch] jithunnair-amd-patch-8 -> origin/jithunnair-amd-patch-8 2025-12-04T08:54:02.2889134Z * [new branch] jithunnair-amd-patch-9 -> origin/jithunnair-amd-patch-9 2025-12-04T08:54:02.2889209Z * [new branch] justinchu/native-qdq -> origin/justinchu/native-qdq 2025-12-04T08:54:02.2889281Z * [new branch] kainan666/xlf_debug -> origin/kainan666/xlf_debug 2025-12-04T08:54:02.2889368Z * [new branch] kainan_test -> origin/kainan_test 2025-12-04T08:54:02.2889444Z * [new branch] larryliu0820-patch-1 -> origin/larryliu0820-patch-1 2025-12-04T08:54:02.2889549Z * [new branch] leslie/test_group_gemm_epilogues -> origin/leslie/test_group_gemm_epilogues 2025-12-04T08:54:02.2889651Z * [new branch] lessw2020/fix_cutlass_cache_error -> origin/lessw2020/fix_cutlass_cache_error 2025-12-04T08:54:02.2889728Z * [new branch] liaoxuan/shm_all_reduce -> origin/liaoxuan/shm_all_reduce 2025-12-04T08:54:02.2889831Z * [new branch] liaoxuan/test_fa_disable_softmax -> origin/liaoxuan/test_fa_disable_softmax 2025-12-04T08:54:02.2889910Z * [new branch] liaoxuan/test_int8_sdpa -> origin/liaoxuan/test_int8_sdpa 2025-12-04T08:54:02.2889976Z * [new branch] llama4-stable -> origin/llama4-stable 2025-12-04T08:54:02.2890043Z * [new branch] lts/release/1.8 -> origin/lts/release/1.8 2025-12-04T08:54:02.2890116Z * [new branch] lucaskabela/#94773 -> origin/lucaskabela/#94773 2025-12-04T08:54:02.2890191Z * [new branch] lucaskabela/fix_164876 -> origin/lucaskabela/fix_164876 2025-12-04T08:54:02.2890273Z * [new branch] lucaskabela/flop_counter -> origin/lucaskabela/flop_counter 2025-12-04T08:54:02.2890367Z * [new branch] lucaskabela/func_under_decomp -> origin/lucaskabela/func_under_decomp 2025-12-04T08:54:02.2890470Z * [new branch] lucaskabela/functional_in_dynamo -> origin/lucaskabela/functional_in_dynamo 2025-12-04T08:54:02.2890595Z * [new branch] lucaskabela/install_params_as_graph_attr -> origin/lucaskabela/install_params_as_graph_attr 2025-12-04T08:54:02.2890707Z * [new branch] lucaskabela/parameters_as_graph_attr -> origin/lucaskabela/parameters_as_graph_attr 2025-12-04T08:54:02.2890839Z * [new branch] lucaskabela/remove_aot_dispatcher_metadata -> origin/lucaskabela/remove_aot_dispatcher_metadata 2025-12-04T08:54:02.2890918Z * [new branch] lucaskabela/rnn_decomp -> origin/lucaskabela/rnn_decomp 2025-12-04T08:54:02.2891008Z * [new branch] lucaskabela/typing_backends -> origin/lucaskabela/typing_backends 2025-12-04T08:54:02.2891105Z * [new branch] lucaskabela/typing_ctx_manager -> origin/lucaskabela/typing_ctx_manager 2025-12-04T08:54:02.2891196Z * [new branch] lucaskabela/typing_nn_module -> origin/lucaskabela/typing_nn_module 2025-12-04T08:54:02.2891322Z * [new branch] lucaskabela/typing_user_defined -> origin/lucaskabela/typing_user_defined 2025-12-04T08:54:02.2891415Z * [new branch] lucaskabela/typing_variables -> origin/lucaskabela/typing_variables 2025-12-04T08:54:02.2891524Z * [new branch] lucaskabela/typing_variables_dicts -> origin/lucaskabela/typing_variables_dicts 2025-12-04T08:54:02.2891644Z * [new branch] lucaskabela/typing_variables_functions -> origin/lucaskabela/typing_variables_functions 2025-12-04T08:54:02.2891753Z * [new branch] lucaskabela/typing_variables_lists -> origin/lucaskabela/typing_variables_lists 2025-12-04T08:54:02.2891825Z * [new branch] lw/torch_box_by_ref -> origin/lw/torch_box_by_ref 2025-12-04T08:54:02.2891885Z * [new branch] main -> origin/main 2025-12-04T08:54:02.2891955Z * [new branch] malfet-patch-1 -> origin/malfet-patch-1 2025-12-04T08:54:02.2892023Z * [new branch] malfet-patch-2 -> origin/malfet-patch-2 2025-12-04T08:54:02.2892090Z * [new branch] malfet-patch-3 -> origin/malfet-patch-3 2025-12-04T08:54:02.2892155Z * [new branch] malfet-patch-4 -> origin/malfet-patch-4 2025-12-04T08:54:02.2892219Z * [new branch] malfet-patch-5 -> origin/malfet-patch-5 2025-12-04T08:54:02.2892324Z * [new branch] malfet-patch-6 -> origin/malfet-patch-6 2025-12-04T08:54:02.2892388Z * [new branch] malfet-patch-7 -> origin/malfet-patch-7 2025-12-04T08:54:02.2892451Z * [new branch] malfet-patch-8 -> origin/malfet-patch-8 2025-12-04T08:54:02.2892525Z * [new branch] malfet/add-3.14-ci -> origin/malfet/add-3.14-ci 2025-12-04T08:54:02.2892682Z * [new branch] malfet/be-do-not-make-typos-in-build-artifacts -> origin/malfet/be-do-not-make-typos-in-build-artifacts 2025-12-04T08:54:02.2892845Z * [new branch] malfet/be-move-more-settings-to-checkout-pytorch -> origin/malfet/be-move-more-settings-to-checkout-pytorch 2025-12-04T08:54:02.2892972Z * [new branch] malfet/be-remove-misisng-neon-headers -> origin/malfet/be-remove-misisng-neon-headers 2025-12-04T08:54:02.2893067Z * [new branch] malfet/mps-implement-col2im -> origin/malfet/mps-implement-col2im 2025-12-04T08:54:02.2893183Z * [new branch] manuel/aoti_metal_shimify-thread_safe -> origin/manuel/aoti_metal_shimify-thread_safe 2025-12-04T08:54:02.2893275Z * [new branch] manuel/inductor_link_openmp -> origin/manuel/inductor_link_openmp 2025-12-04T08:54:02.2893349Z * [new branch] masnesral/metaconda -> origin/masnesral/metaconda 2025-12-04T08:54:02.2893424Z * [new branch] mem_profiler_flaky_fix -> origin/mem_profiler_flaky_fix 2025-12-04T08:54:02.2893503Z * [new branch] mem_profiler_stack_trace -> origin/mem_profiler_stack_trace 2025-12-04T08:54:02.2893579Z * [new branch] memory_profiler_stack -> origin/memory_profiler_stack 2025-12-04T08:54:02.2893651Z * [new branch] metascroy-patch-1 -> origin/metascroy-patch-1 2025-12-04T08:54:02.2893715Z * [new branch] mingw_posix -> origin/mingw_posix 2025-12-04T08:54:02.2893788Z * [new branch] mlazos/S429861-debug -> origin/mlazos/S429861-debug 2025-12-04T08:54:02.2893850Z * [new branch] mlazos/aa -> origin/mlazos/aa 2025-12-04T08:54:02.2893912Z * [new branch] mlazos/acts -> origin/mlazos/acts 2025-12-04T08:54:02.2893983Z * [new branch] mlazos/arg-renames -> origin/mlazos/arg-renames 2025-12-04T08:54:02.2894060Z * [new branch] mlazos/bad-cudagraphs -> origin/mlazos/bad-cudagraphs 2025-12-04T08:54:02.2894158Z * [new branch] mlazos/baseline-graph-breaks -> origin/mlazos/baseline-graph-breaks 2025-12-04T08:54:02.2894257Z * [new branch] mlazos/beta-tensor -> origin/mlazos/beta-tensor 2025-12-04T08:54:02.2894323Z * [new branch] mlazos/buffers -> origin/mlazos/buffers 2025-12-04T08:54:02.2894388Z * [new branch] mlazos/buffers2 -> origin/mlazos/buffers2 2025-12-04T08:54:02.2894454Z * [new branch] mlazos/buffers3 -> origin/mlazos/buffers3 2025-12-04T08:54:02.2894518Z * [new branch] mlazos/bwd -> origin/mlazos/bwd 2025-12-04T08:54:02.2894587Z * [new branch] mlazos/combo-test -> origin/mlazos/combo-test 2025-12-04T08:54:02.2894657Z * [new branch] mlazos/ctx-cleanup -> origin/mlazos/ctx-cleanup 2025-12-04T08:54:02.2894732Z * [new branch] mlazos/cuda-cmd-log -> origin/mlazos/cuda-cmd-log 2025-12-04T08:54:02.2894811Z * [new branch] mlazos/cudagraph-tests -> origin/mlazos/cudagraph-tests 2025-12-04T08:54:02.2894913Z * [new branch] mlazos/cudagraphs-measurement -> origin/mlazos/cudagraphs-measurement 2025-12-04T08:54:02.2894987Z * [new branch] mlazos/cutlass-test -> origin/mlazos/cutlass-test 2025-12-04T08:54:02.2895067Z * [new branch] mlazos/cutlass-topo-bug -> origin/mlazos/cutlass-topo-bug 2025-12-04T08:54:02.2895173Z * [new branch] mlazos/dataclass-proxy -> origin/mlazos/dataclass-proxy 2025-12-04T08:54:02.2895241Z * [new branch] mlazos/dc-attrs -> origin/mlazos/dc-attrs 2025-12-04T08:54:02.2895309Z * [new branch] mlazos/dc-helion -> origin/mlazos/dc-helion 2025-12-04T08:54:02.2895375Z * [new branch] mlazos/dict-fix -> origin/mlazos/dict-fix 2025-12-04T08:54:02.2895443Z * [new branch] mlazos/disable-tf -> origin/mlazos/disable-tf 2025-12-04T08:54:02.2895508Z * [new branch] mlazos/dupe-fix -> origin/mlazos/dupe-fix 2025-12-04T08:54:02.2895577Z * [new branch] mlazos/dyn-batch -> origin/mlazos/dyn-batch 2025-12-04T08:54:02.2895638Z * [new branch] mlazos/evt -> origin/mlazos/evt 2025-12-04T08:54:02.2895717Z * [new branch] mlazos/extract-examples -> origin/mlazos/extract-examples 2025-12-04T08:54:02.2895788Z * [new branch] mlazos/foreach-op -> origin/mlazos/foreach-op 2025-12-04T08:54:02.2895850Z * [new branch] mlazos/fp8 -> origin/mlazos/fp8 2025-12-04T08:54:02.2895916Z * [new branch] mlazos/fp8-bias -> origin/mlazos/fp8-bias 2025-12-04T08:54:02.2896043Z * [new branch] mlazos/fp8-bias-fusion -> origin/mlazos/fp8-bias-fusion 2025-12-04T08:54:02.2896111Z * [new branch] mlazos/fp8-fixes -> origin/mlazos/fp8-fixes 2025-12-04T08:54:02.2896176Z * [new branch] mlazos/freezing -> origin/mlazos/freezing 2025-12-04T08:54:02.2896244Z * [new branch] mlazos/h-comp -> origin/mlazos/h-comp 2025-12-04T08:54:02.2896309Z * [new branch] mlazos/h-comp2 -> origin/mlazos/h-comp2 2025-12-04T08:54:02.2896374Z * [new branch] mlazos/hash-hop -> origin/mlazos/hash-hop 2025-12-04T08:54:02.2896435Z * [new branch] mlazos/hc -> origin/mlazos/hc 2025-12-04T08:54:02.2896503Z * [new branch] mlazos/hc-cycles -> origin/mlazos/hc-cycles 2025-12-04T08:54:02.2896567Z * [new branch] mlazos/hc-fixes -> origin/mlazos/hc-fixes 2025-12-04T08:54:02.2896634Z * [new branch] mlazos/hc-fixes3 -> origin/mlazos/hc-fixes3 2025-12-04T08:54:02.2896699Z * [new branch] mlazos/hc-fixes4 -> origin/mlazos/hc-fixes4 2025-12-04T08:54:02.2896762Z * [new branch] mlazos/hc-hf -> origin/mlazos/hc-hf 2025-12-04T08:54:02.2896827Z * [new branch] mlazos/hc-mut -> origin/mlazos/hc-mut 2025-12-04T08:54:02.2896938Z * [new branch] mlazos/hc10 -> origin/mlazos/hc10 2025-12-04T08:54:02.2896998Z * [new branch] mlazos/hc11 -> origin/mlazos/hc11 2025-12-04T08:54:02.2897060Z * [new branch] mlazos/hc12 -> origin/mlazos/hc12 2025-12-04T08:54:02.2897121Z * [new branch] mlazos/hc13 -> origin/mlazos/hc13 2025-12-04T08:54:02.2897180Z * [new branch] mlazos/hc14 -> origin/mlazos/hc14 2025-12-04T08:54:02.2897239Z * [new branch] mlazos/hc15 -> origin/mlazos/hc15 2025-12-04T08:54:02.2897299Z * [new branch] mlazos/hc2 -> origin/mlazos/hc2 2025-12-04T08:54:02.2897361Z * [new branch] mlazos/hc4 -> origin/mlazos/hc4 2025-12-04T08:54:02.2897420Z * [new branch] mlazos/hc5 -> origin/mlazos/hc5 2025-12-04T08:54:02.2897481Z * [new branch] mlazos/hc6 -> origin/mlazos/hc6 2025-12-04T08:54:02.2897541Z * [new branch] mlazos/hc7 -> origin/mlazos/hc7 2025-12-04T08:54:02.2897599Z * [new branch] mlazos/hc8 -> origin/mlazos/hc8 2025-12-04T08:54:02.2897657Z * [new branch] mlazos/hc9 -> origin/mlazos/hc9 2025-12-04T08:54:02.2897769Z * [new branch] mlazos/hc_baseline2 -> origin/mlazos/hc_baseline2 2025-12-04T08:54:02.2897851Z * [new branch] mlazos/inductor-streams -> origin/mlazos/inductor-streams 2025-12-04T08:54:02.2897912Z * [new branch] mlazos/main -> origin/mlazos/main 2025-12-04T08:54:02.2897973Z * [new branch] mlazos/mcg2 -> origin/mlazos/mcg2 2025-12-04T08:54:02.2898045Z * [new branch] mlazos/meta-guards -> origin/mlazos/meta-guards 2025-12-04T08:54:02.2898149Z * [new branch] mlazos/mlazos/foreach-map-adam -> origin/mlazos/mlazos/foreach-map-adam 2025-12-04T08:54:02.2898247Z * [new branch] mlazos/mlazos/tf-mode-backup -> origin/mlazos/mlazos/tf-mode-backup 2025-12-04T08:54:02.2898312Z * [new branch] mlazos/mod-fix -> origin/mlazos/mod-fix 2025-12-04T08:54:02.2898378Z * [new branch] mlazos/mode-fix -> origin/mlazos/mode-fix 2025-12-04T08:54:02.2898445Z * [new branch] mlazos/offsets -> origin/mlazos/offsets 2025-12-04T08:54:02.2898517Z * [new branch] mlazos/overguarding -> origin/mlazos/overguarding 2025-12-04T08:54:02.2898589Z * [new branch] mlazos/proxy-ctors -> origin/mlazos/proxy-ctors 2025-12-04T08:54:02.2898656Z * [new branch] mlazos/quant-fix -> origin/mlazos/quant-fix 2025-12-04T08:54:02.2898724Z * [new branch] mlazos/resnet-fix -> origin/mlazos/resnet-fix 2025-12-04T08:54:02.2898796Z * [new branch] mlazos/rm-buf-names -> origin/mlazos/rm-buf-names 2025-12-04T08:54:02.2898863Z * [new branch] mlazos/rm-code -> origin/mlazos/rm-code 2025-12-04T08:54:02.2898927Z * [new branch] mlazos/rm-spam -> origin/mlazos/rm-spam 2025-12-04T08:54:02.2898988Z * [new branch] mlazos/rtp -> origin/mlazos/rtp 2025-12-04T08:54:02.2899066Z * [new branch] mlazos/static-idx-dbg -> origin/mlazos/static-idx-dbg 2025-12-04T08:54:02.2899150Z * [new branch] mlazos/static-inputs-log -> origin/mlazos/static-inputs-log 2025-12-04T08:54:02.2899215Z * [new branch] mlazos/stests -> origin/mlazos/stests 2025-12-04T08:54:02.2899285Z * [new branch] mlazos/stream-ops -> origin/mlazos/stream-ops 2025-12-04T08:54:02.2899348Z * [new branch] mlazos/td-fix2 -> origin/mlazos/td-fix2 2025-12-04T08:54:02.2899426Z * [new branch] mlazos/tensor-hasattr2 -> origin/mlazos/tensor-hasattr2 2025-12-04T08:54:02.2899514Z * [new branch] mlazos/test -> origin/mlazos/test 2025-12-04T08:54:02.2899577Z * [new branch] mlazos/tf-mode -> origin/mlazos/tf-mode 2025-12-04T08:54:02.2899655Z * [new branch] mlazos/tf-mode-backup2 -> origin/mlazos/tf-mode-backup2 2025-12-04T08:54:02.2899732Z * [new branch] mlazos/tf-mode-reland -> origin/mlazos/tf-mode-reland 2025-12-04T08:54:02.2899807Z * [new branch] mlazos/tf-mode-reland2 -> origin/mlazos/tf-mode-reland2 2025-12-04T08:54:02.2899883Z * [new branch] mlazos/tf-mode-reland3 -> origin/mlazos/tf-mode-reland3 2025-12-04T08:54:02.2899958Z * [new branch] mlazos/triton-no-epi -> origin/mlazos/triton-no-epi 2025-12-04T08:54:02.2900027Z * [new branch] mlazos/tune-proto -> origin/mlazos/tune-proto 2025-12-04T08:54:02.2900099Z * [new branch] mlazos/tuple-fixes -> origin/mlazos/tuple-fixes 2025-12-04T08:54:02.2900173Z * [new branch] mlazos/tuple-fixes2 -> origin/mlazos/tuple-fixes2 2025-12-04T08:54:02.2900249Z * [new branch] mlazos/tuple-handling -> origin/mlazos/tuple-handling 2025-12-04T08:54:02.2900330Z * [new branch] mlazos/user-stream-base -> origin/mlazos/user-stream-base 2025-12-04T08:54:02.2900426Z * [new branch] mlazos/user-streams -> origin/mlazos/user-streams 2025-12-04T08:54:02.2900516Z * [new branch] mlazos/user-streams-backup -> origin/mlazos/user-streams-backup 2025-12-04T08:54:02.2900609Z * [new branch] mlazos/user-streams-backup2 -> origin/mlazos/user-streams-backup2 2025-12-04T08:54:02.2900677Z * [new branch] mlazos/vary-beta -> origin/mlazos/vary-beta 2025-12-04T08:54:02.2900745Z * [new branch] mlazos/vary-beta2 -> origin/mlazos/vary-beta2 2025-12-04T08:54:02.2900818Z * [new branch] mlazos/weird-perf1 -> origin/mlazos/weird-perf1 2025-12-04T08:54:02.2900891Z * [new branch] mm_out_dtype_compile -> origin/mm_out_dtype_compile 2025-12-04T08:54:02.2900955Z * [new branch] module-shim -> origin/module-shim 2025-12-04T08:54:02.2901016Z * [new branch] move_config -> origin/move_config 2025-12-04T08:54:02.2901085Z * [new branch] msaroufim/reduce -> origin/msaroufim/reduce 2025-12-04T08:54:02.2901153Z * [new branch] mtia/basic-cmake -> origin/mtia/basic-cmake 2025-12-04T08:54:02.2901252Z * [new branch] mwizak/fix-triton-block-shape -> origin/mwizak/fix-triton-block-shape 2025-12-04T08:54:02.2901318Z * [new branch] my_varlen_backup -> origin/my_varlen_backup 2025-12-04T08:54:02.2901393Z * [new branch] nativert_num_outputs -> origin/nativert_num_outputs 2025-12-04T08:54:02.2901454Z * [new branch] new-codegen -> origin/new-codegen 2025-12-04T08:54:02.2901519Z * [new branch] newtest-base -> origin/newtest-base 2025-12-04T08:54:02.2901592Z * [new branch] ngimel/addmm_dtype -> origin/ngimel/addmm_dtype 2025-12-04T08:54:02.2901655Z * [new branch] ngimel/div_inv -> origin/ngimel/div_inv 2025-12-04T08:54:02.2901733Z * [new branch] ngimel/error_index_list -> origin/ngimel/error_index_list 2025-12-04T08:54:02.2901804Z * [new branch] ngimel/gather_grid -> origin/ngimel/gather_grid 2025-12-04T08:54:02.2901889Z * [new branch] ngimel/gather_grid_release -> origin/ngimel/gather_grid_release 2025-12-04T08:54:02.2901952Z * [new branch] ngimel/gg_new -> origin/ngimel/gg_new 2025-12-04T08:54:02.2902019Z * [new branch] ngimel/hostalloc -> origin/ngimel/hostalloc 2025-12-04T08:54:02.2902086Z * [new branch] ngimel/storage_id -> origin/ngimel/storage_id 2025-12-04T08:54:02.2902171Z * [new branch] nightly -> origin/nightly 2025-12-04T08:54:02.2902287Z * [new branch] nikitaved/addmm_1_rowcol_lt_path_check -> origin/nikitaved/addmm_1_rowcol_lt_path_check 2025-12-04T08:54:02.2902409Z * [new branch] nikitaved/addmm_epilogue_fusions_2d_bias -> origin/nikitaved/addmm_epilogue_fusions_2d_bias 2025-12-04T08:54:02.2902535Z * [new branch] nikitaved/addmm_epilogue_fusions_inductor -> origin/nikitaved/addmm_epilogue_fusions_inductor 2025-12-04T08:54:02.2902657Z * [new branch] nikitaved/addmm_epilogue_fusions_scratch -> origin/nikitaved/addmm_epilogue_fusions_scratch 2025-12-04T08:54:02.2902770Z * [new branch] nikitaved/grad_addmm_epilogue_fusions -> origin/nikitaved/grad_addmm_epilogue_fusions 2025-12-04T08:54:02.2902882Z * [new branch] nikitaved/simpler_can_use_32bit_index -> origin/nikitaved/simpler_can_use_32bit_index 2025-12-04T08:54:02.2902949Z * [new branch] nikitaved/test -> origin/nikitaved/test 2025-12-04T08:54:02.2903071Z * [new branch] nmacchioni-perf-test-async-autotune -> origin/nmacchioni-perf-test-async-autotune 2025-12-04T08:54:02.2903149Z * [new branch] no_distributed_log_spew -> origin/no_distributed_log_spew 2025-12-04T08:54:02.2903248Z * [new branch] nofun-hack -> origin/nofun-hack 2025-12-04T08:54:02.2903310Z * [new branch] norm_bench -> origin/norm_bench 2025-12-04T08:54:02.2903385Z * [new branch] nullplay/fuse_matmul -> origin/nullplay/fuse_matmul 2025-12-04T08:54:02.2903457Z * [new branch] nullplay_fuse_matmul -> origin/nullplay_fuse_matmul 2025-12-04T08:54:02.2903522Z * [new branch] optimizer_test -> origin/optimizer_test 2025-12-04T08:54:02.2903590Z * [new branch] orig/release/1.10 -> origin/orig/release/1.10 2025-12-04T08:54:02.2903659Z * [new branch] orig/release/1.11 -> origin/orig/release/1.11 2025-12-04T08:54:02.2903725Z * [new branch] orig/release/1.12 -> origin/orig/release/1.12 2025-12-04T08:54:02.2903792Z * [new branch] orig/release/1.13 -> origin/orig/release/1.13 2025-12-04T08:54:02.2903857Z * [new branch] orig/release/1.6 -> origin/orig/release/1.6 2025-12-04T08:54:02.2903925Z * [new branch] orig/release/1.7 -> origin/orig/release/1.7 2025-12-04T08:54:02.2903991Z * [new branch] orig/release/1.8 -> origin/orig/release/1.8 2025-12-04T08:54:02.2904056Z * [new branch] orig/release/1.9 -> origin/orig/release/1.9 2025-12-04T08:54:02.2904121Z * [new branch] orig/release/2.0 -> origin/orig/release/2.0 2025-12-04T08:54:02.2904184Z * [new branch] orig/release/2.1 -> origin/orig/release/2.1 2025-12-04T08:54:02.2904249Z * [new branch] orig/release/2.2 -> origin/orig/release/2.2 2025-12-04T08:54:02.2904314Z * [new branch] orig/release/2.3 -> origin/orig/release/2.3 2025-12-04T08:54:02.2904377Z * [new branch] orig/release/2.4 -> origin/orig/release/2.4 2025-12-04T08:54:02.2904441Z * [new branch] orig/release/2.5 -> origin/orig/release/2.5 2025-12-04T08:54:02.2904507Z * [new branch] orig/release/2.6 -> origin/orig/release/2.6 2025-12-04T08:54:02.2904570Z * [new branch] orig/release/2.7 -> origin/orig/release/2.7 2025-12-04T08:54:02.2904634Z * [new branch] orig/release/2.8 -> origin/orig/release/2.8 2025-12-04T08:54:02.2904699Z * [new branch] orig/release/2.9 -> origin/orig/release/2.9 2025-12-04T08:54:02.2904783Z * [new branch] origin/gh/fxdawnn/1/base -> origin/origin/gh/fxdawnn/1/base 2025-12-04T08:54:02.2904896Z * [new branch] origin/gh/fxdawnn/1/orig -> origin/origin/gh/fxdawnn/1/orig 2025-12-04T08:54:02.2904978Z * [new branch] origin/gh/zpcore/14/orig -> origin/origin/gh/zpcore/14/orig 2025-12-04T08:54:02.2905044Z * [new branch] oulgen-patch-1 -> origin/oulgen-patch-1 2025-12-04T08:54:02.2905111Z * [new branch] oulgen-patch-2 -> origin/oulgen-patch-2 2025-12-04T08:54:02.2905179Z * [new branch] oulgen-patch-3 -> origin/oulgen-patch-3 2025-12-04T08:54:02.2905244Z * [new branch] oulgen-patch-4 -> origin/oulgen-patch-4 2025-12-04T08:54:02.2905310Z * [new branch] padded-tensor -> origin/padded-tensor 2025-12-04T08:54:02.2905374Z * [new branch] pca2 -> origin/pca2 2025-12-04T08:54:02.2905444Z * [new branch] per_channel_backup -> origin/per_channel_backup 2025-12-04T08:54:02.2905506Z * [new branch] perf_ops -> origin/perf_ops 2025-12-04T08:54:02.2905575Z * [new branch] perf_ops_2_9 -> origin/perf_ops_2_9 2025-12-04T08:54:02.2905645Z * [new branch] pianpwk-patch-1 -> origin/pianpwk-patch-1 2025-12-04T08:54:02.2905731Z * [new branch] pianpwk/__draft_debug_mode -> origin/pianpwk/__draft_debug_mode 2025-12-04T08:54:02.2905866Z * [new branch] pianpwk/_debug_mode_for_triton_draft -> origin/pianpwk/_debug_mode_for_triton_draft 2025-12-04T08:54:02.2906012Z * [new branch] pianpwk/_debug_nn_module_compile -> origin/pianpwk/_debug_nn_module_compile 2025-12-04T08:54:02.2906099Z * [new branch] pianpwk/_draft_triton_11_3 -> origin/pianpwk/_draft_triton_11_3 2025-12-04T08:54:02.2906190Z * [new branch] pianpwk/_manual_bucket_draft -> origin/pianpwk/_manual_bucket_draft 2025-12-04T08:54:02.2906291Z * [new branch] pianpwk/_profile_w_dispatch_keys -> origin/pianpwk/_profile_w_dispatch_keys 2025-12-04T08:54:02.2906390Z * [new branch] pianpwk/_super_draft_debug_mode -> origin/pianpwk/_super_draft_debug_mode 2025-12-04T08:54:02.2906495Z * [new branch] pianpwk/_unbacked_local_shard_size -> origin/pianpwk/_unbacked_local_shard_size 2025-12-04T08:54:02.2906569Z * [new branch] pianpwk/anomaly_tb -> origin/pianpwk/anomaly_tb 2025-12-04T08:54:02.2906653Z * [new branch] pianpwk/auto_fx_annotate -> origin/pianpwk/auto_fx_annotate 2025-12-04T08:54:02.2906763Z * [new branch] pianpwk/backed_size_oblivious_export -> origin/pianpwk/backed_size_oblivious_export 2025-12-04T08:54:02.2906849Z * [new branch] pianpwk/bert_dynamic_perf -> origin/pianpwk/bert_dynamic_perf 2025-12-04T08:54:02.2906945Z * [new branch] pianpwk/debug_fwd_stack_traces -> origin/pianpwk/debug_fwd_stack_traces 2025-12-04T08:54:02.2907030Z * [new branch] pianpwk/debug_hash_tensor -> origin/pianpwk/debug_hash_tensor 2025-12-04T08:54:02.2907120Z * [new branch] pianpwk/debug_mode_annotate -> origin/pianpwk/debug_mode_annotate 2025-12-04T08:54:02.2907208Z * [new branch] pianpwk/debug_mode_defaults -> origin/pianpwk/debug_mode_defaults 2025-12-04T08:54:02.2907288Z * [new branch] pianpwk/debug_mode_hacks -> origin/pianpwk/debug_mode_hacks 2025-12-04T08:54:02.2907396Z * [new branch] pianpwk/debug_mode_opcall_refactor -> origin/pianpwk/debug_mode_opcall_refactor 2025-12-04T08:54:02.2907482Z * [new branch] pianpwk/debug_mode_show_ids -> origin/pianpwk/debug_mode_show_ids 2025-12-04T08:54:02.2907564Z * [new branch] pianpwk/debug_mode_triton -> origin/pianpwk/debug_mode_triton 2025-12-04T08:54:02.2907659Z * [new branch] pianpwk/debug_show_stack_trace -> origin/pianpwk/debug_show_stack_trace 2025-12-04T08:54:02.2907759Z * [new branch] pianpwk/debug_wait_on_collective -> origin/pianpwk/debug_wait_on_collective 2025-12-04T08:54:02.2907893Z * [new branch] pianpwk/debugmode_compile_tf -> origin/pianpwk/debugmode_compile_tf 2025-12-04T08:54:02.2908016Z * [new branch] pianpwk/dispatch_key_debugging_for_debug -> origin/pianpwk/dispatch_key_debugging_for_debug 2025-12-04T08:54:02.2908122Z * [new branch] pianpwk/draft_debug_mode_tfcompile -> origin/pianpwk/draft_debug_mode_tfcompile 2025-12-04T08:54:02.2908215Z * [new branch] pianpwk/draft_multikernel_nn -> origin/pianpwk/draft_multikernel_nn 2025-12-04T08:54:02.2908331Z * [new branch] pianpwk/draft_multikernel_status_10_5 -> origin/pianpwk/draft_multikernel_status_10_5 2025-12-04T08:54:02.2908421Z * [new branch] pianpwk/dtensor_custom_chunk -> origin/pianpwk/dtensor_custom_chunk 2025-12-04T08:54:02.2908523Z * [new branch] pianpwk/dtensor_unbacked_keypath -> origin/pianpwk/dtensor_unbacked_keypath 2025-12-04T08:54:02.2908603Z * [new branch] pianpwk/event_list_tree -> origin/pianpwk/event_list_tree 2025-12-04T08:54:02.2908683Z * [new branch] pianpwk/false_numel_refs -> origin/pianpwk/false_numel_refs 2025-12-04T08:54:02.2908761Z * [new branch] pianpwk/maybe_guard_rel -> origin/pianpwk/maybe_guard_rel 2025-12-04T08:54:02.2908900Z * [new branch] pianpwk/multikernel_hints_draft -> origin/pianpwk/multikernel_hints_draft 2025-12-04T08:54:02.2909007Z * [new branch] pianpwk/no_size_oblivious_slice_scat -> origin/pianpwk/no_size_oblivious_slice_scat 2025-12-04T08:54:02.2909123Z * [new branch] pianpwk/oblivious_reshape_view_better -> origin/pianpwk/oblivious_reshape_view_better 2025-12-04T08:54:02.2909204Z * [new branch] pianpwk/pre_forward_hook -> origin/pianpwk/pre_forward_hook 2025-12-04T08:54:02.2909309Z * [new branch] pianpwk/skip_python_keys_alternate -> origin/pianpwk/skip_python_keys_alternate 2025-12-04T08:54:02.2909415Z * [new branch] pianpwk/skip_python_keys_in_guards -> origin/pianpwk/skip_python_keys_in_guards 2025-12-04T08:54:02.2909495Z * [new branch] pianpwk/sym_tokens_draft -> origin/pianpwk/sym_tokens_draft 2025-12-04T08:54:02.2909571Z * [new branch] pianpwk/symint_one_hot -> origin/pianpwk/symint_one_hot 2025-12-04T08:54:02.2909686Z * [new branch] pianpwk/test_pointwise_guard_or_false -> origin/pianpwk/test_pointwise_guard_or_false 2025-12-04T08:54:02.2909782Z * [new branch] pianpwk/totally_draft_sym_wrap -> origin/pianpwk/totally_draft_sym_wrap 2025-12-04T08:54:02.2909862Z * [new branch] pianpwk/try_dumb_stuff -> origin/pianpwk/try_dumb_stuff 2025-12-04T08:54:02.2909941Z * [new branch] pianpwk/try_dumb_stuff_2 -> origin/pianpwk/try_dumb_stuff_2 2025-12-04T08:54:02.2910032Z * [new branch] pianpwk/unbacked_dtensor_mm -> origin/pianpwk/unbacked_dtensor_mm 2025-12-04T08:54:02.2910129Z * [new branch] pianpwk/unbacked_tracing_12_2 -> origin/pianpwk/unbacked_tracing_12_2 2025-12-04T08:54:02.2910203Z * [new branch] pianpwk/user_symints -> origin/pianpwk/user_symints 2025-12-04T08:54:02.2910280Z * [new branch] pianpwk/wan21_reshape -> origin/pianpwk/wan21_reshape 2025-12-04T08:54:02.2910377Z * [new branch] piz/fix_partial_backward_1112 -> origin/piz/fix_partial_backward_1112 2025-12-04T08:54:02.2910452Z * [new branch] piz/prop_cache_clean -> origin/piz/prop_cache_clean 2025-12-04T08:54:02.2910520Z * [new branch] pool-separate -> origin/pool-separate 2025-12-04T08:54:02.2910583Z * [new branch] pr-156087 -> origin/pr-156087 2025-12-04T08:54:02.2910642Z * [new branch] pr/131860 -> origin/pr/131860 2025-12-04T08:54:02.2910710Z * [new branch] predispatch_to -> origin/predispatch_to 2025-12-04T08:54:02.2910803Z * [new branch] protect-c17 -> origin/protect-c17 2025-12-04T08:54:02.2910868Z * [new branch] pt-opt-cuda3 -> origin/pt-opt-cuda3 2025-12-04T08:54:02.2910947Z * [new branch] python_compiled_autograd -> origin/python_compiled_autograd 2025-12-04T08:54:02.2911077Z * [new branch] q1l1/fix_device_moved_constant_type_unknown -> origin/q1l1/fix_device_moved_constant_type_unknown 2025-12-04T08:54:02.2911216Z * [new branch] q1l1/fix_wrong_default_type_for_kernel_call_args -> origin/q1l1/fix_wrong_default_type_for_kernel_call_args 2025-12-04T08:54:02.2911294Z * [new branch] qchip/export-D54134695 -> origin/qchip/export-D54134695 2025-12-04T08:54:02.2911368Z * [new branch] quote-pytest_cache -> origin/quote-pytest_cache 2025-12-04T08:54:02.2911463Z * [new branch] reland-accgrad-stream-warn -> origin/reland-accgrad-stream-warn 2025-12-04T08:54:02.2911530Z * [new branch] release/1.10 -> origin/release/1.10 2025-12-04T08:54:02.2911593Z * [new branch] release/1.11 -> origin/release/1.11 2025-12-04T08:54:02.2911655Z * [new branch] release/1.12 -> origin/release/1.12 2025-12-04T08:54:02.2911716Z * [new branch] release/1.13 -> origin/release/1.13 2025-12-04T08:54:02.2911800Z * [new branch] release/1.4 -> origin/release/1.4 2025-12-04T08:54:02.2911864Z * [new branch] release/1.4.1 -> origin/release/1.4.1 2025-12-04T08:54:02.2911925Z * [new branch] release/1.5 -> origin/release/1.5 2025-12-04T08:54:02.2911986Z * [new branch] release/1.6 -> origin/release/1.6 2025-12-04T08:54:02.2912045Z * [new branch] release/1.7 -> origin/release/1.7 2025-12-04T08:54:02.2912106Z * [new branch] release/1.8 -> origin/release/1.8 2025-12-04T08:54:02.2912166Z * [new branch] release/1.9 -> origin/release/1.9 2025-12-04T08:54:02.2912225Z * [new branch] release/2.0 -> origin/release/2.0 2025-12-04T08:54:02.2912285Z * [new branch] release/2.1 -> origin/release/2.1 2025-12-04T08:54:02.2912343Z * [new branch] release/2.2 -> origin/release/2.2 2025-12-04T08:54:02.2912404Z * [new branch] release/2.3 -> origin/release/2.3 2025-12-04T08:54:02.2912464Z * [new branch] release/2.4 -> origin/release/2.4 2025-12-04T08:54:02.2912557Z * [new branch] release/2.5 -> origin/release/2.5 2025-12-04T08:54:02.2912616Z * [new branch] release/2.6 -> origin/release/2.6 2025-12-04T08:54:02.2912674Z * [new branch] release/2.7 -> origin/release/2.7 2025-12-04T08:54:02.2912735Z * [new branch] release/2.8 -> origin/release/2.8 2025-12-04T08:54:02.2912796Z * [new branch] release/2.9 -> origin/release/2.9 2025-12-04T08:54:02.2912859Z * [new branch] release_notes -> origin/release_notes 2025-12-04T08:54:02.2912935Z * [new branch] remove_pyinterpreter -> origin/remove_pyinterpreter 2025-12-04T08:54:02.2913057Z * [new branch] replace-pytorch-labs-20250812-195836 -> origin/replace-pytorch-labs-20250812-195836 2025-12-04T08:54:02.2913177Z * [new branch] replace-pytorch-labs-20250812-200248 -> origin/replace-pytorch-labs-20250812-200248 2025-12-04T08:54:02.2913295Z * [new branch] replace-pytorch-labs-20250812-200324 -> origin/replace-pytorch-labs-20250812-200324 2025-12-04T08:54:02.2913410Z * [new branch] replace-pytorch-labs-20250812-204020 -> origin/replace-pytorch-labs-20250812-204020 2025-12-04T08:54:02.2913539Z * [new branch] revert-131069-gh/krzysztofjordan/1/head -> origin/revert-131069-gh/krzysztofjordan/1/head 2025-12-04T08:54:02.2913684Z * [new branch] revert-131469-gh/andrewor14/51/head -> origin/revert-131469-gh/andrewor14/51/head 2025-12-04T08:54:02.2913786Z * [new branch] revert-152361-gh/fadara01/1/head -> origin/revert-152361-gh/fadara01/1/head 2025-12-04T08:54:02.2913889Z * [new branch] revert-156870-gh/skarjala/3/head -> origin/revert-156870-gh/skarjala/3/head 2025-12-04T08:54:02.2914058Z * [new branch] revert-157914-cherry-pick-157503-by-pytorch_bot_bot_ -> origin/revert-157914-cherry-pick-157503-by-pytorch_bot_bot_ 2025-12-04T08:54:02.2914153Z * [new branch] revert-hoo-invoke-subgraph -> origin/revert-hoo-invoke-subgraph 2025-12-04T08:54:02.2914251Z * [new branch] revert_always_build_distributed -> origin/revert_always_build_distributed 2025-12-04T08:54:02.2914318Z * [new branch] rms_norm_patch -> origin/rms_norm_patch 2025-12-04T08:54:02.2914412Z * [new branch] ruisi/fix_all_to_all_estimation -> origin/ruisi/fix_all_to_all_estimation 2025-12-04T08:54:02.2914495Z * [new branch] ruisi/fix_comm_estimation -> origin/ruisi/fix_comm_estimation 2025-12-04T08:54:02.2914598Z * [new branch] ruisi/fix_dynamic_shape_estimation -> origin/ruisi/fix_dynamic_shape_estimation 2025-12-04T08:54:02.2914718Z * [new branch] ruisi/fix_llama3_autobucketing -> origin/ruisi/fix_llama3_autobucketing 2025-12-04T08:54:02.2914824Z * [new branch] ruisi/fix_manual_bucketing_ep_pass -> origin/ruisi/fix_manual_bucketing_ep_pass 2025-12-04T08:54:02.2914906Z * [new branch] ruisi/manual_bucket_pass -> origin/ruisi/manual_bucket_pass 2025-12-04T08:54:02.2915050Z * [new branch] ryanguo99/cleanup-dynamo-expected-failures -> origin/ryanguo99/cleanup-dynamo-expected-failures 2025-12-04T08:54:02.2915139Z * [new branch] ryanguo99/fix-closure-var -> origin/ryanguo99/fix-closure-var 2025-12-04T08:54:02.2915217Z * [new branch] rzou/faketensor_bench -> origin/rzou/faketensor_bench 2025-12-04T08:54:02.2915280Z * [new branch] rzou/njt -> origin/rzou/njt 2025-12-04T08:54:02.2915340Z * [new branch] rzou/pca -> origin/rzou/pca 2025-12-04T08:54:02.2915407Z * [new branch] rzou/realprop -> origin/rzou/realprop 2025-12-04T08:54:02.2915471Z * [new branch] samplevllm -> origin/samplevllm 2025-12-04T08:54:02.2915635Z * [new branch] sanchitintel/weird_thing_with_test_cpu_select_algorithm -> origin/sanchitintel/weird_thing_with_test_cpu_select_algorithm 2025-12-04T08:54:02.2915725Z * [new branch] sapling-pr-archive-SS-JIA -> origin/sapling-pr-archive-SS-JIA 2025-12-04T08:54:02.2915838Z * [new branch] sapling-pr-archive-tushar00jain -> origin/sapling-pr-archive-tushar00jain 2025-12-04T08:54:02.2915901Z * [new branch] save -> origin/save 2025-12-04T08:54:02.2915998Z * [new branch] scaled_mm -> origin/scaled_mm 2025-12-04T08:54:02.2916063Z * [new branch] scan_attempt -> origin/scan_attempt 2025-12-04T08:54:02.2916123Z * [new branch] sdym/2.5.1 -> origin/sdym/2.5.1 2025-12-04T08:54:02.2916229Z * [new branch] sekyondaMeta-dynamoconfig-fix -> origin/sekyondaMeta-dynamoconfig-fix 2025-12-04T08:54:02.2916305Z * [new branch] shengf/fx-xform-perf -> origin/shengf/fx-xform-perf 2025-12-04T08:54:02.2916380Z * [new branch] shoumikhin-patch-1 -> origin/shoumikhin-patch-1 2025-12-04T08:54:02.2916453Z * [new branch] solve-accuracy-fix -> origin/solve-accuracy-fix 2025-12-04T08:54:02.2916533Z * [new branch] some_rocm_inductor_skips -> origin/some_rocm_inductor_skips 2025-12-04T08:54:02.2916657Z * [new branch] soulitzer/stash-tls-ac -> origin/soulitzer/stash-tls-ac 2025-12-04T08:54:02.2916738Z * [new branch] sparse-mm-bf16-support -> origin/sparse-mm-bf16-support 2025-12-04T08:54:02.2916811Z * [new branch] starterTaskUpdate -> origin/starterTaskUpdate 2025-12-04T08:54:02.2916869Z * [new branch] suo -> origin/suo 2025-12-04T08:54:02.2916932Z * [new branch] sve-poc -> origin/sve-poc 2025-12-04T08:54:02.2916995Z * [new branch] switch-bn -> origin/switch-bn 2025-12-04T08:54:02.2917086Z * [new branch] sy_annotation_in_autograd_hop -> origin/sy_annotation_in_autograd_hop 2025-12-04T08:54:02.2917156Z * [new branch] sy_aot_eager_record -> origin/sy_aot_eager_record 2025-12-04T08:54:02.2917223Z * [new branch] sy_custom_bucketing -> origin/sy_custom_bucketing 2025-12-04T08:54:02.2917291Z * [new branch] sy_debug_mode_test -> origin/sy_debug_mode_test 2025-12-04T08:54:02.2917356Z * [new branch] sy_deserialize -> origin/sy_deserialize 2025-12-04T08:54:02.2917421Z * [new branch] sy_dump_gm_code -> origin/sy_dump_gm_code 2025-12-04T08:54:02.2917482Z * [new branch] sy_exp -> origin/sy_exp 2025-12-04T08:54:02.2917601Z * [new branch] sy_export_annotation -> origin/sy_export_annotation 2025-12-04T08:54:02.2917669Z * [new branch] sy_invoke_subgraph -> origin/sy_invoke_subgraph 2025-12-04T08:54:02.2917735Z * [new branch] sy_kernel_bw_name -> origin/sy_kernel_bw_name 2025-12-04T08:54:02.2919369Z * [new branch] sy_multi_arch -> origin/sy_multi_arch 2025-12-04T08:54:02.2919441Z * [new branch] sy_nn_module_stack -> origin/sy_nn_module_stack 2025-12-04T08:54:02.2919511Z * [new branch] sy_original_dtensor -> origin/sy_original_dtensor 2025-12-04T08:54:02.2919584Z * [new branch] sy_profiler_cia -> origin/sy_profiler_cia 2025-12-04T08:54:02.2919646Z * [new branch] symm_mem_sync -> origin/symm_mem_sync 2025-12-04T08:54:02.2919730Z * [new branch] sympy-bottleneck-repro -> origin/sympy-bottleneck-repro 2025-12-04T08:54:02.2919811Z * [new branch] tensordict_integration -> origin/tensordict_integration 2025-12-04T08:54:02.2919889Z * [new branch] test-move-conda-builds -> origin/test-move-conda-builds 2025-12-04T08:54:02.2919951Z * [new branch] test-old -> origin/test-old 2025-12-04T08:54:02.2920015Z * [new branch] test/bmm_heur -> origin/test/bmm_heur 2025-12-04T08:54:02.2920111Z * [new branch] tianren/customOp_autotune_fix -> origin/tianren/customOp_autotune_fix 2025-12-04T08:54:02.2920222Z * [new branch] tianren/customOp_enable_max_autotune -> origin/tianren/customOp_enable_max_autotune 2025-12-04T08:54:02.2920305Z * [new branch] tianren/customOp_fusion -> origin/tianren/customOp_fusion 2025-12-04T08:54:02.2920430Z * [new branch] tianren/customop_collectiveop_benchmark -> origin/tianren/customop_collectiveop_benchmark 2025-12-04T08:54:02.2920566Z * [new branch] tianren/customop_collectiveop_benchmark_fix -> origin/tianren/customop_collectiveop_benchmark_fix 2025-12-04T08:54:02.2920666Z * [new branch] tianren/customop_dynamic_config -> origin/tianren/customop_dynamic_config 2025-12-04T08:54:02.2920758Z * [new branch] tianren/dynamic_range_input -> origin/tianren/dynamic_range_input 2025-12-04T08:54:02.2920856Z * [new branch] tianren/dynamic_range_input_fix -> origin/tianren/dynamic_range_input_fix 2025-12-04T08:54:02.2920959Z * [new branch] tianren/dynamic_range_input_merge -> origin/tianren/dynamic_range_input_merge 2025-12-04T08:54:02.2921099Z * [new branch] tianren/flex_paged_attn_fix_temp -> origin/tianren/flex_paged_attn_fix_temp 2025-12-04T08:54:02.2921176Z * [new branch] tianren/fx_codegen_dump -> origin/tianren/fx_codegen_dump 2025-12-04T08:54:02.2921258Z * [new branch] tianren/symmetric_memory -> origin/tianren/symmetric_memory 2025-12-04T08:54:02.2921323Z * [new branch] tianren/test -> origin/tianren/test 2025-12-04T08:54:02.2921400Z * [new branch] tidy_performance_cyy -> origin/tidy_performance_cyy 2025-12-04T08:54:02.2921458Z * [new branch] tmp -> origin/tmp 2025-12-04T08:54:02.2921523Z * [new branch] torchtitan_ep -> origin/torchtitan_ep 2025-12-04T08:54:02.2921601Z * [new branch] torchtitan_integration -> origin/torchtitan_integration 2025-12-04T08:54:02.2921684Z * [new branch] trace_fsdp_torchtune_lora -> origin/trace_fsdp_torchtune_lora 2025-12-04T08:54:02.2921772Z * [new branch] traceable_fsdp_unit_tests -> origin/traceable_fsdp_unit_tests 2025-12-04T08:54:02.2921842Z * [new branch] tree_loop_vec_base -> origin/tree_loop_vec_base 2025-12-04T08:54:02.2921906Z * [new branch] triton_kernel -> origin/triton_kernel 2025-12-04T08:54:02.2921993Z * [new branch] tt_pkg_1908 -> origin/tt_pkg_1908 2025-12-04T08:54:02.2922057Z * [new branch] type_dec -> origin/type_dec 2025-12-04T08:54:02.2922148Z * [new branch] udate-sphinx-dependancies -> origin/udate-sphinx-dependancies 2025-12-04T08:54:02.2922285Z * [new branch] update-audio-commit-hash/17630256502-1803-1 -> origin/update-audio-commit-hash/17630256502-1803-1 2025-12-04T08:54:02.2922416Z * [new branch] update-audio-commit-hash/19087141161-1916-1 -> origin/update-audio-commit-hash/19087141161-1916-1 2025-12-04T08:54:02.2922546Z * [new branch] update-audio-commit-hash/19250643381-1929-1 -> origin/update-audio-commit-hash/19250643381-1929-1 2025-12-04T08:54:02.2922675Z * [new branch] update-audio-commit-hash/19397724337-1935-1 -> origin/update-audio-commit-hash/19397724337-1935-1 2025-12-04T08:54:02.2922804Z * [new branch] update-audio-commit-hash/19555670148-1941-1 -> origin/update-audio-commit-hash/19555670148-1941-1 2025-12-04T08:54:02.2922930Z * [new branch] update-audio-commit-hash/19750627930-1946-1 -> origin/update-audio-commit-hash/19750627930-1946-1 2025-12-04T08:54:02.2923064Z * [new branch] update-triton-commit-hash/13663274526-1487-2 -> origin/update-triton-commit-hash/13663274526-1487-2 2025-12-04T08:54:02.2923195Z * [new branch] update-vision-commit-hash/19087141161-1916-1 -> origin/update-vision-commit-hash/19087141161-1916-1 2025-12-04T08:54:02.2923325Z * [new branch] update-vision-commit-hash/19184897099-1925-1 -> origin/update-vision-commit-hash/19184897099-1925-1 2025-12-04T08:54:02.2923458Z * [new branch] update-vision-commit-hash/19250643381-1929-1 -> origin/update-vision-commit-hash/19250643381-1929-1 2025-12-04T08:54:02.2923588Z * [new branch] update-vision-commit-hash/19381328640-1934-1 -> origin/update-vision-commit-hash/19381328640-1934-1 2025-12-04T08:54:02.2923720Z * [new branch] update-vision-commit-hash/19485237164-1938-1 -> origin/update-vision-commit-hash/19485237164-1938-1 2025-12-04T08:54:02.2923847Z * [new branch] update-vllm-commit-hash/18451675449-1879-1 -> origin/update-vllm-commit-hash/18451675449-1879-1 2025-12-04T08:54:02.2923930Z * [new branch] update-vllm-dockerfile -> origin/update-vllm-dockerfile 2025-12-04T08:54:02.2924053Z * [new branch] update-xla-commit-hash/19224287370-211-1 -> origin/update-xla-commit-hash/19224287370-211-1 2025-12-04T08:54:02.2924199Z * [new branch] update-xla-commit-hash/19422028566-212-1 -> origin/update-xla-commit-hash/19422028566-212-1 2025-12-04T08:54:02.2924320Z * [new branch] update-xla-commit-hash/19626841311-213-1 -> origin/update-xla-commit-hash/19626841311-213-1 2025-12-04T08:54:02.2924444Z * [new branch] update_docs_torch_multinomial_issue#125388 -> origin/update_docs_torch_multinomial_issue#125388 2025-12-04T08:54:02.2924523Z * [new branch] update_operator_readme -> origin/update_operator_readme 2025-12-04T08:54:02.2924611Z * [new branch] update_slow_tests_1722488736 -> origin/update_slow_tests_1722488736 2025-12-04T08:54:02.2924700Z * [new branch] update_slow_tests_1722879173 -> origin/update_slow_tests_1722879173 2025-12-04T08:54:02.2924785Z * [new branch] update_slow_tests_1762155677 -> origin/update_slow_tests_1762155677 2025-12-04T08:54:02.2924870Z * [new branch] update_slow_tests_1763365283 -> origin/update_slow_tests_1763365283 2025-12-04T08:54:02.2924957Z * [new branch] update_submodule_FBGEMM -> origin/update_submodule_FBGEMM 2025-12-04T08:54:02.2925034Z * [new branch] update_submodule_kineto -> origin/update_submodule_kineto 2025-12-04T08:54:02.2925124Z * [new branch] update_submodule_tensorpipe -> origin/update_submodule_tensorpipe 2025-12-04T08:54:02.2925251Z * [new branch] upload-tests-for-autorevert -> origin/upload-tests-for-autorevert 2025-12-04T08:54:02.2925312Z * [new branch] v0.1.2 -> origin/v0.1.2 2025-12-04T08:54:02.2925373Z * [new branch] v1.0.1 -> origin/v1.0.1 2025-12-04T08:54:02.2925430Z * [new branch] v1.0.3 -> origin/v1.0.3 2025-12-04T08:54:02.2925486Z * [new branch] v1.1.0 -> origin/v1.1.0 2025-12-04T08:54:02.2925542Z * [new branch] v1.2.0 -> origin/v1.2.0 2025-12-04T08:54:02.2925600Z * [new branch] v1.3.0 -> origin/v1.3.0 2025-12-04T08:54:02.2925655Z * [new branch] v1.3.1 -> origin/v1.3.1 2025-12-04T08:54:02.2925720Z * [new branch] validate_fn -> origin/validate_fn 2025-12-04T08:54:02.2925786Z * [new branch] validations_2.6 -> origin/validations_2.6 2025-12-04T08:54:02.2925853Z * [new branch] validations_2.8 -> origin/validations_2.8 2025-12-04T08:54:02.2925918Z * [new branch] varlen-api -> origin/varlen-api 2025-12-04T08:54:02.2926027Z * [new branch] varlen-api-backup -> origin/varlen-api-backup 2025-12-04T08:54:02.2926105Z * [new branch] varlen_batch_invariance -> origin/varlen_batch_invariance 2025-12-04T08:54:02.2926171Z * [new branch] viable/strict -> origin/viable/strict 2025-12-04T08:54:02.2926288Z * [new branch] vishal9-team/dtensor_parallelism_toy -> origin/vishal9-team/dtensor_parallelism_toy 2025-12-04T08:54:02.2926350Z * [new branch] vllmbuildci -> origin/vllmbuildci 2025-12-04T08:54:02.2926411Z * [new branch] vllmpin -> origin/vllmpin 2025-12-04T08:54:02.2926498Z * [new branch] vscode-recommend-pyrefly -> origin/vscode-recommend-pyrefly 2025-12-04T08:54:02.2926565Z * [new branch] wdvr-patch-1 -> origin/wdvr-patch-1 2025-12-04T08:54:02.2926630Z * [new branch] wdvr/iss_145259 -> origin/wdvr/iss_145259 2025-12-04T08:54:02.2926690Z * [new branch] whc/pei -> origin/whc/pei 2025-12-04T08:54:02.2926753Z * [new branch] whc/pp_fix -> origin/whc/pp_fix 2025-12-04T08:54:02.2926816Z * [new branch] whc/sharding -> origin/whc/sharding 2025-12-04T08:54:02.2926880Z * [new branch] whc/sharding2 -> origin/whc/sharding2 2025-12-04T08:54:02.2926991Z * [new branch] whc/uneven -> origin/whc/uneven 2025-12-04T08:54:02.2927061Z * [new branch] whc/uneven-merge -> origin/whc/uneven-merge 2025-12-04T08:54:02.2927122Z * [new branch] win_warnings -> origin/win_warnings 2025-12-04T08:54:02.2927198Z * [new branch] windows_libtorch_free -> origin/windows_libtorch_free 2025-12-04T08:54:02.2927261Z * [new branch] xmfan-war -> origin/xmfan-war 2025-12-04T08:54:02.2927324Z * [new branch] xmfan/ca_0516 -> origin/xmfan/ca_0516 2025-12-04T08:54:02.2927392Z * [new branch] xmfan/ca_1051b93192 -> origin/xmfan/ca_1051b93192 2025-12-04T08:54:02.2927540Z * [new branch] xmfan/ca_1a722f62c248391fc4a542e8851a5559aa356ae8 -> origin/xmfan/ca_1a722f62c248391fc4a542e8851a5559aa356ae8 2025-12-04T08:54:02.2927609Z * [new branch] xmfan/ca_5a2be192d1 -> origin/xmfan/ca_5a2be192d1 2025-12-04T08:54:02.2927678Z * [new branch] xmfan/ca_9d59b516e9 -> origin/xmfan/ca_9d59b516e9 2025-12-04T08:54:02.2927743Z * [new branch] xmfan/ca_apr8 -> origin/xmfan/ca_apr8 2025-12-04T08:54:02.2927804Z * [new branch] xmfan/ca_base -> origin/xmfan/ca_base 2025-12-04T08:54:02.2927913Z * [new branch] xmfan/ca_dynamic -> origin/xmfan/ca_dynamic 2025-12-04T08:54:02.2927980Z * [new branch] xmfan/ca_fix_dyn -> origin/xmfan/ca_fix_dyn 2025-12-04T08:54:02.2928052Z * [new branch] xmfan/ca_fix_lowering -> origin/xmfan/ca_fix_lowering 2025-12-04T08:54:02.2928127Z * [new branch] xmfan/ca_fix_polyfills -> origin/xmfan/ca_fix_polyfills 2025-12-04T08:54:02.2928188Z * [new branch] xmfan/ca_jan3 -> origin/xmfan/ca_jan3 2025-12-04T08:54:02.2928251Z * [new branch] xmfan/ca_jun18 -> origin/xmfan/ca_jun18 2025-12-04T08:54:02.2928317Z * [new branch] xmfan/ca_jun24 -> origin/xmfan/ca_jun24 2025-12-04T08:54:02.2928382Z * [new branch] xmfan/ca_nested -> origin/xmfan/ca_nested 2025-12-04T08:54:02.2928448Z * [new branch] xmfan/ca_overhead -> origin/xmfan/ca_overhead 2025-12-04T08:54:02.2928541Z * [new branch] xmfan/ca_overhead_0eba7e5451 -> origin/xmfan/ca_overhead_0eba7e5451 2025-12-04T08:54:02.2928608Z * [new branch] xmfan/cacu_jun18 -> origin/xmfan/cacu_jun18 2025-12-04T08:54:02.2928673Z * [new branch] xmfan/cacu_jun19 -> origin/xmfan/cacu_jun19 2025-12-04T08:54:02.2928739Z * [new branch] xmfan/cacu_jun4 -> origin/xmfan/cacu_jun4 2025-12-04T08:54:02.2928819Z * [new branch] xmfan/disable_duck_shape -> origin/xmfan/disable_duck_shape 2025-12-04T08:54:02.2928915Z * [new branch] xmfan/fca_cpp_node_passthrough -> origin/xmfan/fca_cpp_node_passthrough 2025-12-04T08:54:02.2929067Z * [new branch] xmfan/post_3945954741e2d37023c5d6954f9483008e0892f9 -> origin/xmfan/post_3945954741e2d37023c5d6954f9483008e0892f9 2025-12-04T08:54:02.2929210Z * [new branch] xmfan/pre_3945954741e2d37023c5d6954f9483008e0892f9 -> origin/xmfan/pre_3945954741e2d37023c5d6954f9483008e0892f9 2025-12-04T08:54:02.2929280Z * [new branch] xmfan/single_step -> origin/xmfan/single_step 2025-12-04T08:54:02.2929345Z * [new branch] xmfan/sth_0829 -> origin/xmfan/sth_0829 2025-12-04T08:54:02.2929405Z * [new branch] xmfan/test -> origin/xmfan/test 2025-12-04T08:54:02.2929492Z * [new branch] yguo/debug-0226-constexpr -> origin/yguo/debug-0226-constexpr 2025-12-04T08:54:02.2929568Z * [new branch] yguo/new_latest_changes -> origin/yguo/new_latest_changes 2025-12-04T08:54:02.2929662Z * [new branch] yguo/patch_constexpr_changes -> origin/yguo/patch_constexpr_changes 2025-12-04T08:54:02.2929762Z * [new branch] yiming/bootcamp -> origin/yiming/bootcamp 2025-12-04T08:54:02.2929862Z * [new branch] yiming/run_with_start_end_rng_hop -> origin/yiming/run_with_start_end_rng_hop 2025-12-04T08:54:02.2929926Z * [new branch] yolo-llama3 -> origin/yolo-llama3 2025-12-04T08:54:02.2930000Z * [new branch] zainr/canary-test -> origin/zainr/canary-test 2025-12-04T08:54:02.2930086Z * [new branch] zainr/cleanup-gh-runners -> origin/zainr/cleanup-gh-runners 2025-12-04T08:54:02.2930163Z * [new branch] zainr/pull-migration-c -> origin/zainr/pull-migration-c 2025-12-04T08:54:02.2930225Z * [new branch] zainr/test2 -> origin/zainr/test2 2025-12-04T08:54:02.2930296Z * [new branch] zasdfgbnm-patch-3 -> origin/zasdfgbnm-patch-3 2025-12-04T08:54:02.2930355Z * [new branch] zb2p -> origin/zb2p 2025-12-04T08:54:02.2930439Z * [new branch] zeros-and-scatter-part2 -> origin/zeros-and-scatter-part2 2025-12-04T08:54:02.2930524Z * [new branch] zhxchen17/ci/vllm_lora_oom -> origin/zhxchen17/ci/vllm_lora_oom 2025-12-04T08:54:02.2930624Z * [new branch] zhxchen17/ci/vllm_multimodal_oom -> origin/zhxchen17/ci/vllm_multimodal_oom 2025-12-04T08:54:02.2930726Z * [new branch] zhxchen17/ci/vllm_pin -> origin/zhxchen17/ci/vllm_pin 2025-12-04T08:54:02.2930851Z * [new branch] zhxchen17/dynamo/unsafe_drop_all_guards -> origin/zhxchen17/dynamo/unsafe_drop_all_guards 2025-12-04T08:54:02.2930947Z * [new branch] zhxchen17/export/call_override -> origin/zhxchen17/export/call_override 2025-12-04T08:54:02.2931035Z * [new branch] zhxchen17/export/codemod1 -> origin/zhxchen17/export/codemod1 2025-12-04T08:54:02.2931123Z * [new branch] zhxchen17/export/ctx_return -> origin/zhxchen17/export/ctx_return 2025-12-04T08:54:02.2931254Z * [new branch] zhxchen17/export/disable_side_effect_warn -> origin/zhxchen17/export/disable_side_effect_warn 2025-12-04T08:54:02.2931352Z * [new branch] zhxchen17/export/pytree_check -> origin/zhxchen17/export/pytree_check 2025-12-04T08:54:02.2931441Z * [new branch] zhxchen17/precompile/aoti -> origin/zhxchen17/precompile/aoti 2025-12-04T08:54:02.2931537Z * [new branch] zhxchen17/precompile/globals -> origin/zhxchen17/precompile/globals 2025-12-04T08:54:02.2931651Z * [new branch] zhxchen17/precompile/inductor_guards -> origin/zhxchen17/precompile/inductor_guards 2025-12-04T08:54:02.2931725Z * [new branch] zhxchen17/scratch/0 -> origin/zhxchen17/scratch/0 2025-12-04T08:54:02.2931830Z * [new branch] zhxchen17/torch_export_api_update -> origin/zhxchen17/torch_export_api_update 2025-12-04T08:54:02.2931907Z * [new branch] zhxhcen17/moodycamel -> origin/zhxhcen17/moodycamel 2025-12-04T08:54:02.2931980Z * [new branch] zxiiro/build-times -> origin/zxiiro/build-times 2025-12-04T08:54:02.2932052Z * [new branch] zxiiro/c7i.2xlarge -> origin/zxiiro/c7i.2xlarge 2025-12-04T08:54:02.2932130Z * [new branch] zxiiro/c7i.2xlarge.h100 -> origin/zxiiro/c7i.2xlarge.h100 2025-12-04T08:54:02.2932193Z * [new branch] zxiiro/main -> origin/zxiiro/main 2025-12-04T08:54:02.2932258Z * [new branch] zxiiro/risc64 -> origin/zxiiro/risc64 2025-12-04T08:54:02.2932347Z * [new branch] zxiiro/test-multicloud-arc -> origin/zxiiro/test-multicloud-arc 2025-12-04T08:54:02.2932407Z * [new tag] ciflow/dynamo/169525 -> ciflow/dynamo/169525 2025-12-04T08:54:02.2932477Z t [tag update] ciflow/inductor/167647 -> ciflow/inductor/167647 2025-12-04T08:54:02.2932575Z t [tag update] ciflow/inductor/168266 -> ciflow/inductor/168266 2025-12-04T08:54:02.2932640Z t [tag update] ciflow/inductor/169535 -> ciflow/inductor/169535 2025-12-04T08:54:02.2932700Z * [new tag] ciflow/trunk/165728 -> ciflow/trunk/165728 2025-12-04T08:54:02.2932759Z * [new tag] ciflow/trunk/169048 -> ciflow/trunk/169048 2025-12-04T08:54:02.2932820Z * [new tag] ciflow/trunk/169125 -> ciflow/trunk/169125 2025-12-04T08:54:02.2932878Z * [new tag] ciflow/trunk/169555 -> ciflow/trunk/169555 2025-12-04T08:54:02.2932936Z * [new tag] ciflow/xpu/169555 -> ciflow/xpu/169555 2025-12-04T08:54:02.4835884Z [command]/usr/bin/git rev-parse --verify --quiet ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32^{object} 2025-12-04T08:54:02.5021323Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:54:02.5025363Z ##[endgroup] 2025-12-04T08:54:02.5025919Z ##[group]Determining the checkout info 2025-12-04T08:54:02.5027989Z ##[endgroup] 2025-12-04T08:54:02.5035190Z [command]/usr/bin/git sparse-checkout disable 2025-12-04T08:54:02.5147729Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-12-04T08:54:02.5181102Z ##[group]Checking out the ref 2025-12-04T08:54:02.5185331Z [command]/usr/bin/git checkout --progress --force ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:54:02.5481764Z HEAD is now at ffd9b0fb4355 Resolve collective autotuning test failure on arm (#168919) 2025-12-04T08:54:02.5486725Z ##[endgroup] 2025-12-04T08:54:02.5487273Z ##[group]Setting up auth for fetching submodules 2025-12-04T08:54:02.5491603Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T08:54:02.5516140Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-12-04T08:54:02.5531182Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-12-04T08:54:02.5560330Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-12-04T08:54:02.5575281Z ##[endgroup] 2025-12-04T08:54:02.5575786Z ##[group]Fetching submodules 2025-12-04T08:54:02.5576356Z [command]/usr/bin/git submodule sync --recursive 2025-12-04T08:54:02.5863057Z Synchronizing submodule url for 'android/libs/fbjni' 2025-12-04T08:54:02.5874955Z Synchronizing submodule url for 'third_party/FP16' 2025-12-04T08:54:02.5888825Z Synchronizing submodule url for 'third_party/FXdiv' 2025-12-04T08:54:02.5903352Z Synchronizing submodule url for 'third_party/NNPACK' 2025-12-04T08:54:02.5924458Z Synchronizing submodule url for 'third_party/NVTX' 2025-12-04T08:54:02.5948096Z Synchronizing submodule url for 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:02.5970091Z Synchronizing submodule url for 'third_party/XNNPACK' 2025-12-04T08:54:02.6000688Z Synchronizing submodule url for 'third_party/aiter' 2025-12-04T08:54:02.6015387Z Synchronizing submodule url for 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:02.6034854Z Synchronizing submodule url for 'third_party/benchmark' 2025-12-04T08:54:02.6052519Z Synchronizing submodule url for 'third_party/composable_kernel' 2025-12-04T08:54:02.6068044Z Synchronizing submodule url for 'third_party/cpp-httplib' 2025-12-04T08:54:02.6083265Z Synchronizing submodule url for 'third_party/cpuinfo' 2025-12-04T08:54:02.6106082Z Synchronizing submodule url for 'third_party/cudnn_frontend' 2025-12-04T08:54:02.6119327Z Synchronizing submodule url for 'third_party/cutlass' 2025-12-04T08:54:02.6136321Z Synchronizing submodule url for 'third_party/fbgemm' 2025-12-04T08:54:02.6154824Z Synchronizing submodule url for 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:02.6166166Z Synchronizing submodule url for 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:02.6182919Z Synchronizing submodule url for 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:02.6199638Z Synchronizing submodule url for 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:02.6214219Z Synchronizing submodule url for 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:02.6225203Z Synchronizing submodule url for 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:02.6240036Z Synchronizing submodule url for 'third_party/fbgemm/external/json' 2025-12-04T08:54:02.6253858Z Synchronizing submodule url for 'third_party/flash-attention' 2025-12-04T08:54:02.6268140Z Synchronizing submodule url for 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:02.6280358Z Synchronizing submodule url for 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:02.6313488Z Synchronizing submodule url for 'third_party/flatbuffers' 2025-12-04T08:54:02.6326158Z Synchronizing submodule url for 'third_party/fmt' 2025-12-04T08:54:02.6346637Z Synchronizing submodule url for 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:02.6367296Z Synchronizing submodule url for 'third_party/gloo' 2025-12-04T08:54:02.6384232Z Synchronizing submodule url for 'third_party/googletest' 2025-12-04T08:54:02.6407239Z Synchronizing submodule url for 'third_party/ideep' 2025-12-04T08:54:02.6418597Z Synchronizing submodule url for 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:02.6433013Z Synchronizing submodule url for 'third_party/ittapi' 2025-12-04T08:54:02.6445770Z Synchronizing submodule url for 'third_party/kineto' 2025-12-04T08:54:02.6461137Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:02.6471630Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:02.6483246Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:02.6492802Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:02.6501895Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:02.6522579Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:02.6532942Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:02.6554333Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:02.6572324Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:02.6583356Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:02.6602780Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:02.6614764Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:02.6637914Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:02.6653674Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:02.6664415Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:02.6690851Z Synchronizing submodule url for 'third_party/kleidiai' 2025-12-04T08:54:02.6708216Z Synchronizing submodule url for 'third_party/mimalloc' 2025-12-04T08:54:02.6720131Z Synchronizing submodule url for 'third_party/nlohmann' 2025-12-04T08:54:02.6731785Z Synchronizing submodule url for 'third_party/onnx' 2025-12-04T08:54:02.6768911Z Synchronizing submodule url for 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:02.6783187Z Synchronizing submodule url for 'third_party/opentelemetry-cpp' 2025-12-04T08:54:02.6804780Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:02.6826057Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:02.6853250Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:02.6864177Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:02.6885916Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:02.6896731Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:02.6907362Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:02.6924891Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:02.6936699Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:02.6948719Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:02.6968476Z Synchronizing submodule url for 'third_party/pocketfft' 2025-12-04T08:54:02.6980124Z Synchronizing submodule url for 'third_party/protobuf' 2025-12-04T08:54:02.6992439Z Synchronizing submodule url for 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:02.7013624Z Synchronizing submodule url for 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:02.7027963Z Synchronizing submodule url for 'third_party/psimd' 2025-12-04T08:54:02.7038813Z Synchronizing submodule url for 'third_party/pthreadpool' 2025-12-04T08:54:02.7049655Z Synchronizing submodule url for 'third_party/pybind11' 2025-12-04T08:54:02.7070729Z Synchronizing submodule url for 'third_party/python-peachpy' 2025-12-04T08:54:02.7082520Z Synchronizing submodule url for 'third_party/sleef' 2025-12-04T08:54:02.7092854Z Synchronizing submodule url for 'third_party/tensorpipe' 2025-12-04T08:54:02.7106767Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:02.7115375Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:02.7125700Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:02.7142901Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:02.7169707Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:02.7217074Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --recursive 2025-12-04T08:54:02.7508367Z Submodule path 'android/libs/fbjni': checked out '7e1e1fe3858c63c251c637ae41a20de425dde96f' 2025-12-04T08:54:02.7588303Z Submodule path 'third_party/FP16': checked out '4dfe081cf6bcd15db339cf2680b9281b8451eeb3' 2025-12-04T08:54:02.7645743Z Submodule path 'third_party/FXdiv': checked out 'b408327ac2a15ec3e43352421954f5b1967701d1' 2025-12-04T08:54:02.7778433Z Submodule path 'third_party/NNPACK': checked out 'c07e3a0400713d546e0dea2d5466dd22ea389c73' 2025-12-04T08:54:02.7867068Z Submodule path 'third_party/NVTX': checked out '3ebbc93ded7285963bff932c678fa367eb393ba6' 2025-12-04T08:54:02.7934118Z Submodule path 'third_party/VulkanMemoryAllocator': checked out '1d8f600fd424278486eade7ed3e877c99f0846b1' 2025-12-04T08:54:03.2855053Z Submodule path 'third_party/XNNPACK': checked out '51a0103656eff6fc9bfd39a4597923c4b542c883' 2025-12-04T08:54:03.3076559Z Submodule path 'third_party/aiter': checked out '01aae101b9e5e94d6c16a9514c9fb8df99c93150' 2025-12-04T08:54:03.3318260Z Submodule path 'third_party/aiter/3rdparty/composable_kernel': checked out 'cffe8fa2a442ac8e80dd236a1a5d24fe3d7e0cbf' 2025-12-04T08:54:03.3447791Z Submodule path 'third_party/benchmark': checked out '299e5928955cc62af9968370293b916f5130916f' 2025-12-04T08:54:03.3676353Z Submodule path 'third_party/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T08:54:03.3779024Z Submodule path 'third_party/cpp-httplib': checked out '89c932f313c6437c38f2982869beacc89c2f2246' 2025-12-04T08:54:03.4438916Z Submodule path 'third_party/cpuinfo': checked out 'f858c30bcb16f8effd5ff46996f0514539e17abc' 2025-12-04T08:54:03.4569025Z Submodule path 'third_party/cudnn_frontend': checked out '0b1577c8c83401237d601d0d0db5210506705396' 2025-12-04T08:54:03.4734398Z Submodule path 'third_party/cutlass': checked out 'f88806b1e31dfa579842638740216dd41fc6c588' 2025-12-04T08:54:03.5499998Z Submodule path 'third_party/fbgemm': checked out 'c0b988d39a9e47c794d699f29930ed4d7c7e13a4' 2025-12-04T08:54:03.5865447Z Submodule path 'third_party/fbgemm/external/asmjit': checked out 'a3199e8857792cd10b7589ff5d58343d2c9008ea' 2025-12-04T08:54:03.7750836Z Submodule path 'third_party/fbgemm/external/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T08:54:03.8488260Z Submodule path 'third_party/fbgemm/external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-12-04T08:54:04.2924679Z Submodule path 'third_party/fbgemm/external/cutlass': checked out '98125ce499b0fdf7ffbe0e3052f5b8709f4840f8' 2025-12-04T08:54:04.3186218Z Submodule path 'third_party/fbgemm/external/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T08:54:04.3340461Z Submodule path 'third_party/fbgemm/external/hipify_torch': checked out '63b6a7b541fa7f08f8475ca7d74054db36ff2691' 2025-12-04T08:54:04.3957243Z Submodule path 'third_party/fbgemm/external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-12-04T08:54:04.4115484Z Submodule path 'third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5' 2025-12-04T08:54:04.4295435Z Submodule path 'third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33' 2025-12-04T08:54:04.4487688Z Submodule path 'third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420' 2025-12-04T08:54:04.4625303Z Submodule path 'third_party/flatbuffers': checked out 'a2cd1ea3b6d3fee220106b5fed3f7ce8da9eb757' 2025-12-04T08:54:04.4798686Z Submodule path 'third_party/fmt': checked out '407c905e45ad75fc29bf0f9bb7c5c2fd3475976f' 2025-12-04T08:54:04.5044199Z Submodule path 'third_party/gemmlowp/gemmlowp': checked out '3fb5c176c17c765a3492cd2f0321b0dab712f350' 2025-12-04T08:54:04.5210025Z Submodule path 'third_party/gloo': checked out '54cbae0d3a67fa890b4c3d9ee162b7860315e341' 2025-12-04T08:54:04.5412656Z Submodule path 'third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T08:54:04.5525575Z Submodule path 'third_party/ideep': checked out '719d8e6cd7f7a0e01b155657526d693acf97c2b3' 2025-12-04T08:54:04.9599480Z Submodule path 'third_party/ideep/mkl-dnn': checked out '8d263e693366ef8db40acc569cc7d8edf644556d' 2025-12-04T08:54:04.9736459Z Submodule path 'third_party/ittapi': checked out 'dec1d23ca65ab069d225dfe40dea14f455170959' 2025-12-04T08:54:04.9828569Z Submodule path 'third_party/kineto': checked out '31f85df8fbd89c188f14ef10f1ec65379786b943' 2025-12-04T08:54:04.9940634Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1' 2025-12-04T08:54:05.0051423Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM': checked out 'ffde4e54bc7249a6039a5e6b45b395141e1217f9' 2025-12-04T08:54:05.0126835Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr': checked out '871ed52d350214a034f6ef8a3b8f51c5ce1bd400' 2025-12-04T08:54:05.0229366Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt': checked out 'cd4af11efc9c622896a3e4cb599fa28668ca3d05' 2025-12-04T08:54:05.0317842Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags': checked out 'e171aa2d15ed9eb17054558e0b3a6a413bb01067' 2025-12-04T08:54:05.0427402Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc': checked out '8411df715cf522606e3b1aca386ddfc0b63d34b4' 2025-12-04T08:54:05.0523019Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog': checked out 'b33e3bad4c46c8a6345525fd822af355e5ef9446' 2025-12-04T08:54:05.0612731Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T08:54:05.0713995Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json': checked out '4f8fba14066156b73f1189a2b8bd568bde5284c5' 2025-12-04T08:54:05.0779408Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs': checked out 'f68a2fa8ea36c783bdd760371411fcb495aa3150' 2025-12-04T08:54:05.0865690Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a' 2025-12-04T08:54:05.0965657Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159' 2025-12-04T08:54:05.1084022Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T08:54:05.1164922Z Submodule path 'third_party/kineto/libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21' 2025-12-04T08:54:05.1273057Z Submodule path 'third_party/kineto/libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T08:54:05.1379316Z Submodule path 'third_party/kleidiai': checked out 'd7770c89632329a9914ef1a90289917597639cbe' 2025-12-04T08:54:05.1517151Z Submodule path 'third_party/mimalloc': checked out 'fbd8b99c2b828428947d70fdc046bb55609be93e' 2025-12-04T08:54:05.1648722Z Submodule path 'third_party/nlohmann': checked out '55f93686c01528224f448c19128836e7df245f72' 2025-12-04T08:54:05.3358186Z Submodule path 'third_party/onnx': checked out 'e709452ef2bbc1d113faf678c24e6d3467696e83' 2025-12-04T08:54:05.3599260Z Submodule path 'third_party/onnx/third_party/pybind11': checked out 'a2e59f0e7065404b44dfe92a28aca47ba1378dc4' 2025-12-04T08:54:05.3752327Z Submodule path 'third_party/opentelemetry-cpp': checked out 'a799f4aed9c94b765dcdaabaeab7d5e7e2310878' 2025-12-04T08:54:05.3860371Z Submodule path 'third_party/opentelemetry-cpp/third_party/benchmark': checked out 'd572f4777349d43653b21d6c2fc63020ab326db2' 2025-12-04T08:54:05.3970733Z Submodule path 'third_party/opentelemetry-cpp/third_party/googletest': checked out 'b796f7d44681514f58a683a3a71ff17c94edb0c1' 2025-12-04T08:54:05.4047944Z Submodule path 'third_party/opentelemetry-cpp/third_party/ms-gsl': checked out '6f4529395c5b7c2d661812257cd6780c67e54afa' 2025-12-04T08:54:05.4170513Z Submodule path 'third_party/opentelemetry-cpp/third_party/nlohmann-json': checked out 'bc889afb4c5bf1c0d8ee29ef35eaaf4c8bef8a5d' 2025-12-04T08:54:05.4239261Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto': checked out '4ca4f0335c63cda7ab31ea7ed70d6553aee14dce' 2025-12-04T08:54:05.4348864Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentracing-cpp': checked out '06b57f48ded1fa3bdd3d4346f6ef29e40e08eaf5' 2025-12-04T08:54:05.4430422Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp': checked out 'c9ffcdda9086ffd9e1283ea7a0276d831f3c8a8d' 2025-12-04T08:54:05.4526449Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'eefb26f82b233268fc98577d265352720d477ba4' 2025-12-04T08:54:05.4605199Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T08:54:05.4777687Z Submodule path 'third_party/opentelemetry-cpp/tools/vcpkg': checked out '8eb57355a4ffb410a2e94c07b4dca2dffbee8e50' 2025-12-04T08:54:05.4869214Z Submodule path 'third_party/pocketfft': checked out '0fa0ef591e38c2758e3184c6c23e497b9f732ffa' 2025-12-04T08:54:05.6196234Z Submodule path 'third_party/protobuf': checked out 'd1eca4e4b421cd2997495c4b4e65cea6be4e9b8a' 2025-12-04T08:54:05.6327993Z Submodule path 'third_party/protobuf/third_party/benchmark': checked out '5b7683f49e1e9223cf9927b24f6fd3d6bd82e3f8' 2025-12-04T08:54:05.6566034Z Submodule path 'third_party/protobuf/third_party/googletest': checked out '5ec7f0c4a113e2f18ac2c6cc7df51ad6afc24081' 2025-12-04T08:54:05.6677022Z Submodule path 'third_party/psimd': checked out '072586a71b55b7f8c584153d223e95687148a900' 2025-12-04T08:54:05.6776810Z Submodule path 'third_party/pthreadpool': checked out '4fe0e1e183925bf8cfa6aae24237e724a96479b8' 2025-12-04T08:54:05.6995184Z Submodule path 'third_party/pybind11': checked out 'f5fbe867d2d26e4a0a9177a51f6e568868ad3dc8' 2025-12-04T08:54:05.7259846Z Submodule path 'third_party/python-peachpy': checked out 'f45429b087dd7d5bc78bb40dc7cf06425c252d67' 2025-12-04T08:54:05.7554393Z Submodule path 'third_party/sleef': checked out '5a1d179df9cf652951b59010a2d2075372d67f68' 2025-12-04T08:54:05.7710213Z Submodule path 'third_party/tensorpipe': checked out '2b4cd91092d335a697416b2a3cb398283246849d' 2025-12-04T08:54:05.7933099Z Submodule path 'third_party/tensorpipe/third_party/googletest': checked out 'aee0f9d9b5b87796ee8a0ab26b7587ec30e8858e' 2025-12-04T08:54:05.8039282Z Submodule path 'third_party/tensorpipe/third_party/libnop': checked out '910b55815be16109f04f4180e9adee14fb4ce281' 2025-12-04T08:54:05.8328654Z Submodule path 'third_party/tensorpipe/third_party/libuv': checked out '5152db2cbfeb5582e9c27c5ea1dba2cd9e10759b' 2025-12-04T08:54:05.8488929Z Submodule path 'third_party/tensorpipe/third_party/pybind11': checked out 'a23996fce38ff6ccfbcdc09f1e63f2c4be5ea2ef' 2025-12-04T08:54:05.8583515Z Submodule path 'third_party/tensorpipe/third_party/pybind11/tools/clang': checked out '6a00cbc4a9b8e68b71caf7f774b3f9c753ae84d5' 2025-12-04T08:54:05.8635601Z [command]/usr/bin/git submodule foreach --recursive git config --local gc.auto 0 2025-12-04T08:54:05.8885540Z Entering 'android/libs/fbjni' 2025-12-04T08:54:05.8918978Z Entering 'third_party/FP16' 2025-12-04T08:54:05.8944824Z Entering 'third_party/FXdiv' 2025-12-04T08:54:05.8977298Z Entering 'third_party/NNPACK' 2025-12-04T08:54:05.9009807Z Entering 'third_party/NVTX' 2025-12-04T08:54:05.9038457Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:05.9072278Z Entering 'third_party/XNNPACK' 2025-12-04T08:54:05.9139434Z Entering 'third_party/aiter' 2025-12-04T08:54:05.9186292Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:05.9233706Z Entering 'third_party/benchmark' 2025-12-04T08:54:05.9255171Z Entering 'third_party/composable_kernel' 2025-12-04T08:54:05.9278888Z Entering 'third_party/cpp-httplib' 2025-12-04T08:54:05.9321918Z Entering 'third_party/cpuinfo' 2025-12-04T08:54:05.9355018Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:54:05.9386956Z Entering 'third_party/cutlass' 2025-12-04T08:54:05.9429981Z Entering 'third_party/fbgemm' 2025-12-04T08:54:05.9465609Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:05.9497682Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:05.9523214Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:05.9549858Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:05.9601317Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:05.9623725Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:05.9644318Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:54:05.9694158Z Entering 'third_party/flash-attention' 2025-12-04T08:54:05.9717284Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:05.9741261Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:05.9773118Z Entering 'third_party/flatbuffers' 2025-12-04T08:54:05.9827825Z Entering 'third_party/fmt' 2025-12-04T08:54:05.9851356Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:05.9888158Z Entering 'third_party/gloo' 2025-12-04T08:54:05.9926945Z Entering 'third_party/googletest' 2025-12-04T08:54:05.9949771Z Entering 'third_party/ideep' 2025-12-04T08:54:05.9982190Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:06.0020155Z Entering 'third_party/ittapi' 2025-12-04T08:54:06.0053088Z Entering 'third_party/kineto' 2025-12-04T08:54:06.0075904Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:06.0106597Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:06.0138748Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:06.0175503Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:06.0195413Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:06.0227099Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:06.0261777Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:06.0288199Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:06.0308057Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:06.0336587Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:06.0383775Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:06.0404248Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:06.0448381Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:06.0476232Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:06.0498188Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:06.0535019Z Entering 'third_party/kleidiai' 2025-12-04T08:54:06.0559667Z Entering 'third_party/mimalloc' 2025-12-04T08:54:06.0583030Z Entering 'third_party/nlohmann' 2025-12-04T08:54:06.0617014Z Entering 'third_party/onnx' 2025-12-04T08:54:06.0655595Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:06.0678898Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:54:06.0706644Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:06.0728597Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:06.0756474Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:06.0775596Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:06.0797178Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:06.0816506Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:06.0835140Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:06.0854692Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:06.0875153Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:06.0903053Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:06.0932430Z Entering 'third_party/pocketfft' 2025-12-04T08:54:06.0953794Z Entering 'third_party/protobuf' 2025-12-04T08:54:06.0990454Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:06.1027950Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:06.1060741Z Entering 'third_party/psimd' 2025-12-04T08:54:06.1087948Z Entering 'third_party/pthreadpool' 2025-12-04T08:54:06.1111216Z Entering 'third_party/pybind11' 2025-12-04T08:54:06.1156085Z Entering 'third_party/python-peachpy' 2025-12-04T08:54:06.1178043Z Entering 'third_party/sleef' 2025-12-04T08:54:06.1206555Z Entering 'third_party/tensorpipe' 2025-12-04T08:54:06.1241716Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:06.1279262Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:06.1322311Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:06.1363647Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:06.1389003Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:06.1430499Z ##[endgroup] 2025-12-04T08:54:06.1431095Z ##[group]Persisting credentials for submodules 2025-12-04T08:54:06.1443592Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-12-04T08:54:06.1660477Z Entering 'android/libs/fbjni' 2025-12-04T08:54:06.1689830Z Entering 'third_party/FP16' 2025-12-04T08:54:06.1726839Z Entering 'third_party/FXdiv' 2025-12-04T08:54:06.1753476Z Entering 'third_party/NNPACK' 2025-12-04T08:54:06.1795229Z Entering 'third_party/NVTX' 2025-12-04T08:54:06.1820704Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:06.1843142Z Entering 'third_party/XNNPACK' 2025-12-04T08:54:06.1871089Z Entering 'third_party/aiter' 2025-12-04T08:54:06.1901851Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:06.1952146Z Entering 'third_party/benchmark' 2025-12-04T08:54:06.1988617Z Entering 'third_party/composable_kernel' 2025-12-04T08:54:06.2017626Z Entering 'third_party/cpp-httplib' 2025-12-04T08:54:06.2052737Z Entering 'third_party/cpuinfo' 2025-12-04T08:54:06.2088515Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:54:06.2117183Z Entering 'third_party/cutlass' 2025-12-04T08:54:06.2159988Z Entering 'third_party/fbgemm' 2025-12-04T08:54:06.2196356Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:06.2233272Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:06.2270085Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:06.2305481Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:06.2341543Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:06.2363222Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:06.2406802Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:54:06.2441690Z Entering 'third_party/flash-attention' 2025-12-04T08:54:06.2490483Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:06.2523532Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:06.2569605Z Entering 'third_party/flatbuffers' 2025-12-04T08:54:06.2600974Z Entering 'third_party/fmt' 2025-12-04T08:54:06.2625216Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:06.2650378Z Entering 'third_party/gloo' 2025-12-04T08:54:06.2675880Z Entering 'third_party/googletest' 2025-12-04T08:54:06.2701529Z Entering 'third_party/ideep' 2025-12-04T08:54:06.2730494Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:06.2768853Z Entering 'third_party/ittapi' 2025-12-04T08:54:06.2823246Z Entering 'third_party/kineto' 2025-12-04T08:54:06.2850286Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:06.2877936Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:06.2906062Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:06.2934170Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:06.2957385Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:06.2984363Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:06.3008288Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:06.3053123Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:06.3104089Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:06.3139298Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:06.3161122Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:06.3186694Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:06.3225454Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:06.3253630Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:06.3289158Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:06.3315437Z Entering 'third_party/kleidiai' 2025-12-04T08:54:06.3347546Z Entering 'third_party/mimalloc' 2025-12-04T08:54:06.3386236Z Entering 'third_party/nlohmann' 2025-12-04T08:54:06.3418389Z Entering 'third_party/onnx' 2025-12-04T08:54:06.3456851Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:06.3491469Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:54:06.3519530Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:06.3543341Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:06.3598352Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:06.3651170Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:06.3696321Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:06.3743142Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:06.3793767Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:06.3837148Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:06.3894846Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:06.3930929Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:06.3985059Z Entering 'third_party/pocketfft' 2025-12-04T08:54:06.4034242Z Entering 'third_party/protobuf' 2025-12-04T08:54:06.4089211Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:06.4123804Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:06.4155380Z Entering 'third_party/psimd' 2025-12-04T08:54:06.4191772Z Entering 'third_party/pthreadpool' 2025-12-04T08:54:06.4228499Z Entering 'third_party/pybind11' 2025-12-04T08:54:06.4265373Z Entering 'third_party/python-peachpy' 2025-12-04T08:54:06.4292901Z Entering 'third_party/sleef' 2025-12-04T08:54:06.4319941Z Entering 'third_party/tensorpipe' 2025-12-04T08:54:06.4343309Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:06.4381759Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:06.4410799Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:06.4434045Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:06.4459959Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:06.4510816Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-12-04T08:54:06.4737619Z Entering 'android/libs/fbjni' 2025-12-04T08:54:06.4761665Z file:/home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T08:54:06.4782310Z Entering 'third_party/FP16' 2025-12-04T08:54:06.4822245Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T08:54:06.4832964Z Entering 'third_party/FXdiv' 2025-12-04T08:54:06.4857683Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T08:54:06.4877449Z Entering 'third_party/NNPACK' 2025-12-04T08:54:06.4905848Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T08:54:06.4924475Z Entering 'third_party/NVTX' 2025-12-04T08:54:06.4956370Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T08:54:06.4969945Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:06.5007131Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T08:54:06.5017769Z Entering 'third_party/XNNPACK' 2025-12-04T08:54:06.5050019Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T08:54:06.5066892Z Entering 'third_party/aiter' 2025-12-04T08:54:06.5090046Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T08:54:06.5100498Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:06.5130857Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T08:54:06.5149874Z Entering 'third_party/benchmark' 2025-12-04T08:54:06.5170658Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:54:06.5182642Z Entering 'third_party/composable_kernel' 2025-12-04T08:54:06.5211560Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T08:54:06.5226900Z Entering 'third_party/cpp-httplib' 2025-12-04T08:54:06.5248779Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T08:54:06.5260097Z Entering 'third_party/cpuinfo' 2025-12-04T08:54:06.5281441Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T08:54:06.5300916Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:54:06.5323402Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T08:54:06.5334753Z Entering 'third_party/cutlass' 2025-12-04T08:54:06.5356501Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T08:54:06.5371387Z Entering 'third_party/fbgemm' 2025-12-04T08:54:06.5391905Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T08:54:06.5402566Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:06.5443789Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T08:54:06.5463831Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:06.5485418Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T08:54:06.5498643Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:06.5537123Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T08:54:06.5547810Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:06.5567888Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T08:54:06.5593889Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:06.5614975Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T08:54:06.5626297Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:06.5651343Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T08:54:06.5661768Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:54:06.5693115Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T08:54:06.5715434Z Entering 'third_party/flash-attention' 2025-12-04T08:54:06.5743329Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T08:54:06.5763720Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:06.5789076Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T08:54:06.5815730Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:06.5847167Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T08:54:06.5862699Z Entering 'third_party/flatbuffers' 2025-12-04T08:54:06.5891714Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T08:54:06.5903652Z Entering 'third_party/fmt' 2025-12-04T08:54:06.5924450Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T08:54:06.5935093Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:06.5956538Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T08:54:06.5969983Z Entering 'third_party/gloo' 2025-12-04T08:54:06.5995319Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T08:54:06.6005417Z Entering 'third_party/googletest' 2025-12-04T08:54:06.6027448Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:06.6038693Z Entering 'third_party/ideep' 2025-12-04T08:54:06.6069891Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T08:54:06.6081806Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:06.6107759Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T08:54:06.6121407Z Entering 'third_party/ittapi' 2025-12-04T08:54:06.6163125Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T08:54:06.6183753Z Entering 'third_party/kineto' 2025-12-04T08:54:06.6210372Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T08:54:06.6221505Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:06.6251115Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T08:54:06.6271883Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:06.6306395Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T08:54:06.6318061Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:06.6340396Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T08:54:06.6350673Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:06.6381938Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T08:54:06.6392059Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:06.6413507Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T08:54:06.6422595Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:06.6446174Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T08:54:06.6468760Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:06.6508339Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T08:54:06.6519527Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:06.6553347Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:06.6567949Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:06.6591584Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T08:54:06.6603304Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:06.6630350Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T08:54:06.6640850Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:06.6661206Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T08:54:06.6671106Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:06.6715524Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T08:54:06.6724318Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:06.6746333Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T08:54:06.6759942Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:06.6794763Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T08:54:06.6805440Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:06.6840329Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T08:54:06.6854083Z Entering 'third_party/kleidiai' 2025-12-04T08:54:06.6876358Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T08:54:06.6887182Z Entering 'third_party/mimalloc' 2025-12-04T08:54:06.6912294Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T08:54:06.6923007Z Entering 'third_party/nlohmann' 2025-12-04T08:54:06.6954774Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T08:54:06.6966476Z Entering 'third_party/onnx' 2025-12-04T08:54:06.6991227Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T08:54:06.7025590Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:06.7046359Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:54:06.7059940Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:54:06.7098880Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T08:54:06.7109722Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:06.7132983Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:54:06.7146449Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:06.7175167Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:06.7184359Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:06.7205518Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T08:54:06.7215828Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:06.7237604Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T08:54:06.7248944Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:06.7280452Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T08:54:06.7290570Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:06.7309103Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T08:54:06.7329272Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:06.7364268Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T08:54:06.7373968Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:06.7399868Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T08:54:06.7410490Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:06.7438837Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T08:54:06.7458183Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:06.7477036Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T08:54:06.7495298Z Entering 'third_party/pocketfft' 2025-12-04T08:54:06.7521741Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T08:54:06.7532907Z Entering 'third_party/protobuf' 2025-12-04T08:54:06.7570782Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T08:54:06.7583237Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:06.7604420Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:54:06.7613092Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:06.7646356Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:06.7658190Z Entering 'third_party/psimd' 2025-12-04T08:54:06.7679038Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T08:54:06.7699640Z Entering 'third_party/pthreadpool' 2025-12-04T08:54:06.7731300Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T08:54:06.7743197Z Entering 'third_party/pybind11' 2025-12-04T08:54:06.7779063Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:54:06.7791279Z Entering 'third_party/python-peachpy' 2025-12-04T08:54:06.7818223Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T08:54:06.7829464Z Entering 'third_party/sleef' 2025-12-04T08:54:06.7859660Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T08:54:06.7871413Z Entering 'third_party/tensorpipe' 2025-12-04T08:54:06.7902706Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T08:54:06.7914565Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:06.7939088Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:06.7958951Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:06.7989477Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T08:54:06.8000090Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:06.8028120Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T08:54:06.8038953Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:06.8084088Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:54:06.8093775Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:06.8140137Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T08:54:06.8390462Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-12-04T08:54:06.8635590Z Entering 'android/libs/fbjni' 2025-12-04T08:54:06.8675521Z Entering 'third_party/FP16' 2025-12-04T08:54:06.8710003Z Entering 'third_party/FXdiv' 2025-12-04T08:54:06.8754412Z Entering 'third_party/NNPACK' 2025-12-04T08:54:06.8775337Z Entering 'third_party/NVTX' 2025-12-04T08:54:06.8801584Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:06.8833485Z Entering 'third_party/XNNPACK' 2025-12-04T08:54:06.8880022Z Entering 'third_party/aiter' 2025-12-04T08:54:06.8901852Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:06.8926205Z Entering 'third_party/benchmark' 2025-12-04T08:54:06.8958412Z Entering 'third_party/composable_kernel' 2025-12-04T08:54:06.8989340Z Entering 'third_party/cpp-httplib' 2025-12-04T08:54:06.9009900Z Entering 'third_party/cpuinfo' 2025-12-04T08:54:06.9056068Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:54:06.9090059Z Entering 'third_party/cutlass' 2025-12-04T08:54:06.9133808Z Entering 'third_party/fbgemm' 2025-12-04T08:54:06.9160990Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:06.9214178Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:06.9245058Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:06.9263858Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:06.9296226Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:06.9326938Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:06.9345436Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:54:06.9366854Z Entering 'third_party/flash-attention' 2025-12-04T08:54:06.9395070Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:06.9421735Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:06.9447787Z Entering 'third_party/flatbuffers' 2025-12-04T08:54:06.9470304Z Entering 'third_party/fmt' 2025-12-04T08:54:06.9490510Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:06.9532824Z Entering 'third_party/gloo' 2025-12-04T08:54:06.9554539Z Entering 'third_party/googletest' 2025-12-04T08:54:06.9575258Z Entering 'third_party/ideep' 2025-12-04T08:54:06.9596429Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:06.9620314Z Entering 'third_party/ittapi' 2025-12-04T08:54:06.9652792Z Entering 'third_party/kineto' 2025-12-04T08:54:06.9675252Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:06.9706996Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:06.9739748Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:06.9769412Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:06.9814530Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:06.9865439Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:06.9906417Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:06.9940170Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:06.9984111Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:07.0021552Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:07.0062763Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:07.0091985Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:07.0136576Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:07.0171590Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:07.0190745Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:07.0212804Z Entering 'third_party/kleidiai' 2025-12-04T08:54:07.0243797Z Entering 'third_party/mimalloc' 2025-12-04T08:54:07.0277700Z Entering 'third_party/nlohmann' 2025-12-04T08:54:07.0302114Z Entering 'third_party/onnx' 2025-12-04T08:54:07.0340120Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:07.0373155Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:54:07.0403540Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:07.0443331Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:07.0488199Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:07.0509226Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:07.0531705Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:07.0570900Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:07.0602109Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:07.0632511Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:07.0677457Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:07.0704607Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:07.0748508Z Entering 'third_party/pocketfft' 2025-12-04T08:54:07.0770065Z Entering 'third_party/protobuf' 2025-12-04T08:54:07.0803702Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:07.0839717Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:07.0875067Z Entering 'third_party/psimd' 2025-12-04T08:54:07.0914854Z Entering 'third_party/pthreadpool' 2025-12-04T08:54:07.0937044Z Entering 'third_party/pybind11' 2025-12-04T08:54:07.0972699Z Entering 'third_party/python-peachpy' 2025-12-04T08:54:07.1002566Z Entering 'third_party/sleef' 2025-12-04T08:54:07.1023424Z Entering 'third_party/tensorpipe' 2025-12-04T08:54:07.1043961Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:07.1066018Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:07.1098939Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:07.1151398Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:07.1181104Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:07.1226751Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-12-04T08:54:07.1501088Z Entering 'android/libs/fbjni' 2025-12-04T08:54:07.1535780Z Entering 'third_party/FP16' 2025-12-04T08:54:07.1575117Z Entering 'third_party/FXdiv' 2025-12-04T08:54:07.1609585Z Entering 'third_party/NNPACK' 2025-12-04T08:54:07.1647430Z Entering 'third_party/NVTX' 2025-12-04T08:54:07.1681802Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:07.1714382Z Entering 'third_party/XNNPACK' 2025-12-04T08:54:07.1754990Z Entering 'third_party/aiter' 2025-12-04T08:54:07.1778098Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:07.1818558Z Entering 'third_party/benchmark' 2025-12-04T08:54:07.1849736Z Entering 'third_party/composable_kernel' 2025-12-04T08:54:07.1877762Z Entering 'third_party/cpp-httplib' 2025-12-04T08:54:07.1902312Z Entering 'third_party/cpuinfo' 2025-12-04T08:54:07.1934733Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:54:07.1979449Z Entering 'third_party/cutlass' 2025-12-04T08:54:07.2037978Z Entering 'third_party/fbgemm' 2025-12-04T08:54:07.2070267Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:07.2102968Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:07.2156928Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:07.2194572Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:07.2231994Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:07.2258961Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:07.2286698Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:54:07.2307861Z Entering 'third_party/flash-attention' 2025-12-04T08:54:07.2333987Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:07.2356614Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:07.2375635Z Entering 'third_party/flatbuffers' 2025-12-04T08:54:07.2391881Z Entering 'third_party/fmt' 2025-12-04T08:54:07.2407064Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:07.2422045Z Entering 'third_party/gloo' 2025-12-04T08:54:07.2437115Z Entering 'third_party/googletest' 2025-12-04T08:54:07.2451768Z Entering 'third_party/ideep' 2025-12-04T08:54:07.2478103Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:07.2496969Z Entering 'third_party/ittapi' 2025-12-04T08:54:07.2512794Z Entering 'third_party/kineto' 2025-12-04T08:54:07.2527897Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:07.2563316Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:07.2583414Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:07.2601978Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:07.2625327Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:07.2642944Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:07.2662602Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:07.2680411Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:07.2698798Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:07.2717174Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:07.2734885Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:07.2753182Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:07.2770118Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:07.2792276Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:07.2808030Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:07.2827202Z Entering 'third_party/kleidiai' 2025-12-04T08:54:07.2865378Z Entering 'third_party/mimalloc' 2025-12-04T08:54:07.2882852Z Entering 'third_party/nlohmann' 2025-12-04T08:54:07.2900644Z Entering 'third_party/onnx' 2025-12-04T08:54:07.2925206Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:07.2963181Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:54:07.3004572Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:07.3029560Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:07.3047236Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:07.3074456Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:07.3092241Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:07.3109700Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:07.3127280Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:07.3143983Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:07.3174307Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:07.3207622Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:07.3255179Z Entering 'third_party/pocketfft' 2025-12-04T08:54:07.3294693Z Entering 'third_party/protobuf' 2025-12-04T08:54:07.3340661Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:07.3364264Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:07.3401522Z Entering 'third_party/psimd' 2025-12-04T08:54:07.3431383Z Entering 'third_party/pthreadpool' 2025-12-04T08:54:07.3453722Z Entering 'third_party/pybind11' 2025-12-04T08:54:07.3476325Z Entering 'third_party/python-peachpy' 2025-12-04T08:54:07.3496735Z Entering 'third_party/sleef' 2025-12-04T08:54:07.3517196Z Entering 'third_party/tensorpipe' 2025-12-04T08:54:07.3536372Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:07.3557637Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:07.3591553Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:07.3610808Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:07.3640416Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:07.3686920Z ##[endgroup] 2025-12-04T08:54:07.3917914Z [command]/usr/bin/git log -1 --format=%H 2025-12-04T08:54:07.4050028Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:54:07.4254910Z ##[group]Run actions/checkout@v4 2025-12-04T08:54:07.4255275Z with: 2025-12-04T08:54:07.4255619Z ref: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:54:07.4256233Z fetch-depth: 0 2025-12-04T08:54:07.4256527Z submodules: recursive 2025-12-04T08:54:07.4256841Z show-progress: false 2025-12-04T08:54:07.4257183Z repository: pytorch/pytorch 2025-12-04T08:54:07.4257649Z token: *** 2025-12-04T08:54:07.4257922Z ssh-strict: true 2025-12-04T08:54:07.4258198Z ssh-user: git 2025-12-04T08:54:07.4258493Z persist-credentials: true 2025-12-04T08:54:07.4258826Z clean: true 2025-12-04T08:54:07.4259129Z sparse-checkout-cone-mode: true 2025-12-04T08:54:07.4259509Z fetch-tags: false 2025-12-04T08:54:07.4259780Z lfs: false 2025-12-04T08:54:07.4260065Z set-safe-directory: true 2025-12-04T08:54:07.4260375Z env: 2025-12-04T08:54:07.4260640Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:07.4260952Z ##[endgroup] 2025-12-04T08:54:07.4740661Z Syncing repository: pytorch/pytorch 2025-12-04T08:54:07.4741297Z ##[group]Getting Git version info 2025-12-04T08:54:07.4741787Z Working directory is '/home/runner/_work/pytorch/pytorch' 2025-12-04T08:54:07.4754590Z [command]/usr/bin/git version 2025-12-04T08:54:07.4784058Z git version 2.52.0 2025-12-04T08:54:07.4798008Z ##[endgroup] 2025-12-04T08:54:07.4803293Z Copying '/home/runner/.gitconfig' to '/home/runner/_work/_temp/86ab5b77-1f75-4b54-a69c-6d4c09aca442/.gitconfig' 2025-12-04T08:54:07.4809107Z Temporarily overriding HOME='/home/runner/_work/_temp/86ab5b77-1f75-4b54-a69c-6d4c09aca442' before making global git config changes 2025-12-04T08:54:07.4810092Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T08:54:07.4811533Z [command]/usr/bin/git config --global --add safe.directory /home/runner/_work/pytorch/pytorch 2025-12-04T08:54:07.4834553Z [command]/usr/bin/git config --local --get remote.origin.url 2025-12-04T08:54:07.4849977Z https://github.com/pytorch/pytorch 2025-12-04T08:54:07.4861696Z ##[group]Removing previously created refs, to avoid conflicts 2025-12-04T08:54:07.4864262Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD 2025-12-04T08:54:07.4879251Z HEAD 2025-12-04T08:54:07.4917650Z ##[endgroup] 2025-12-04T08:54:07.4918529Z [command]/usr/bin/git submodule status 2025-12-04T08:54:07.5217050Z 7e1e1fe3858c63c251c637ae41a20de425dde96f android/libs/fbjni (v0.1.0-12-g7e1e1fe) 2025-12-04T08:54:07.5298469Z 4dfe081cf6bcd15db339cf2680b9281b8451eeb3 third_party/FP16 (4dfe081) 2025-12-04T08:54:07.5378736Z b408327ac2a15ec3e43352421954f5b1967701d1 third_party/FXdiv (b408327) 2025-12-04T08:54:07.5479452Z c07e3a0400713d546e0dea2d5466dd22ea389c73 third_party/NNPACK (c07e3a0) 2025-12-04T08:54:07.5511081Z 3ebbc93ded7285963bff932c678fa367eb393ba6 third_party/NVTX (v3.1.0-313-g3ebbc93) 2025-12-04T08:54:07.5588206Z 1d8f600fd424278486eade7ed3e877c99f0846b1 third_party/VulkanMemoryAllocator (v2.1.0-982-g1d8f600) 2025-12-04T08:54:07.5890035Z 51a0103656eff6fc9bfd39a4597923c4b542c883 third_party/XNNPACK (remotes/origin/ds/ndk-1243-g51a0103656) 2025-12-04T08:54:07.5947624Z 01aae101b9e5e94d6c16a9514c9fb8df99c93150 third_party/aiter (v0.1.1-92-g01aae101) 2025-12-04T08:54:07.5985342Z 299e5928955cc62af9968370293b916f5130916f third_party/benchmark (v1.9.3) 2025-12-04T08:54:07.6059551Z 7fe50dc3da2069d6645d9deb8c017a876472a977 third_party/composable_kernel (rocm-6.4.3-459-g7fe50dc3d) 2025-12-04T08:54:07.6150931Z 89c932f313c6437c38f2982869beacc89c2f2246 third_party/cpp-httplib (v0.26.0) 2025-12-04T08:54:07.6282103Z f858c30bcb16f8effd5ff46996f0514539e17abc third_party/cpuinfo (f858c30) 2025-12-04T08:54:07.6334244Z 0b1577c8c83401237d601d0d0db5210506705396 third_party/cudnn_frontend (v0.5-61-g0b1577c) 2025-12-04T08:54:07.6413352Z f88806b1e31dfa579842638740216dd41fc6c588 third_party/cutlass (v4.3.1) 2025-12-04T08:54:07.6442902Z c0b988d39a9e47c794d699f29930ed4d7c7e13a4 third_party/fbgemm (v1.4.0-rc1-2-gc0b988d39) 2025-12-04T08:54:07.6514678Z 979702c87a8713a8e0a5e9fee122b90d2ef13be5 third_party/flash-attention (v2.7.4) 2025-12-04T08:54:07.6547587Z a2cd1ea3b6d3fee220106b5fed3f7ce8da9eb757 third_party/flatbuffers (v24.12.23) 2025-12-04T08:54:07.6809786Z 407c905e45ad75fc29bf0f9bb7c5c2fd3475976f third_party/fmt (12.1.0) 2025-12-04T08:54:07.6927190Z 3fb5c176c17c765a3492cd2f0321b0dab712f350 third_party/gemmlowp/gemmlowp (remotes/origin/revert-87-master-135-g3fb5c17) 2025-12-04T08:54:07.7022121Z 54cbae0d3a67fa890b4c3d9ee162b7860315e341 third_party/gloo (remotes/origin/gh/c-p-i-o/1/base-37-g54cbae0) 2025-12-04T08:54:07.7165052Z 52eb8108c5bdec04579160ae17225d66034bd723 third_party/googletest (release-1.8.0-3544-g52eb8108) 2025-12-04T08:54:07.7258593Z 719d8e6cd7f7a0e01b155657526d693acf97c2b3 third_party/ideep (pytorch-rls-v3.7.1) 2025-12-04T08:54:07.7336302Z dec1d23ca65ab069d225dfe40dea14f455170959 third_party/ittapi (v3.25.5) 2025-12-04T08:54:07.7537686Z 31f85df8fbd89c188f14ef10f1ec65379786b943 third_party/kineto (heads/main) 2025-12-04T08:54:07.7575670Z d7770c89632329a9914ef1a90289917597639cbe third_party/kleidiai (v1.15.0) 2025-12-04T08:54:07.7611074Z fbd8b99c2b828428947d70fdc046bb55609be93e third_party/mimalloc (v2.2.4) 2025-12-04T08:54:07.7645224Z 55f93686c01528224f448c19128836e7df245f72 third_party/nlohmann (v3.12.0) 2025-12-04T08:54:07.7882329Z e709452ef2bbc1d113faf678c24e6d3467696e83 third_party/onnx (v1.18.0) 2025-12-04T08:54:07.7920179Z a799f4aed9c94b765dcdaabaeab7d5e7e2310878 third_party/opentelemetry-cpp (v1.14.2) 2025-12-04T08:54:07.7944495Z 0fa0ef591e38c2758e3184c6c23e497b9f732ffa third_party/pocketfft (release_for_eigen-40-g0fa0ef5) 2025-12-04T08:54:07.8177041Z d1eca4e4b421cd2997495c4b4e65cea6be4e9b8a third_party/protobuf (v3.7.0-rc.2-1279-gd1eca4e4b) 2025-12-04T08:54:07.8268271Z 072586a71b55b7f8c584153d223e95687148a900 third_party/psimd (heads/master) 2025-12-04T08:54:07.8354134Z 4fe0e1e183925bf8cfa6aae24237e724a96479b8 third_party/pthreadpool (0.1-144-g4fe0e1e) 2025-12-04T08:54:07.8388114Z f5fbe867d2d26e4a0a9177a51f6e568868ad3dc8 third_party/pybind11 (v3.0.1) 2025-12-04T08:54:07.8475448Z f45429b087dd7d5bc78bb40dc7cf06425c252d67 third_party/python-peachpy (remotes/origin/pre-generated) 2025-12-04T08:54:07.8531948Z 5a1d179df9cf652951b59010a2d2075372d67f68 third_party/sleef (3.8) 2025-12-04T08:54:07.8625496Z 2b4cd91092d335a697416b2a3cb398283246849d third_party/tensorpipe (heads/main) 2025-12-04T08:54:07.8641455Z ##[group]Cleaning the repository 2025-12-04T08:54:07.8646533Z [command]/usr/bin/git clean -ffdx 2025-12-04T08:54:07.8786374Z [command]/usr/bin/git reset --hard HEAD 2025-12-04T08:54:07.9643157Z HEAD is now at ffd9b0fb4355 Resolve collective autotuning test failure on arm (#168919) 2025-12-04T08:54:07.9736136Z ##[endgroup] 2025-12-04T08:54:07.9742165Z ##[group]Disabling automatic garbage collection 2025-12-04T08:54:07.9752208Z [command]/usr/bin/git config --local gc.auto 0 2025-12-04T08:54:07.9794780Z ##[endgroup] 2025-12-04T08:54:07.9795267Z ##[group]Setting up auth 2025-12-04T08:54:07.9806188Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T08:54:07.9852879Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T08:54:08.0102764Z Entering 'android/libs/fbjni' 2025-12-04T08:54:08.0133928Z Entering 'third_party/FP16' 2025-12-04T08:54:08.0180713Z Entering 'third_party/FXdiv' 2025-12-04T08:54:08.0219436Z Entering 'third_party/NNPACK' 2025-12-04T08:54:08.0263195Z Entering 'third_party/NVTX' 2025-12-04T08:54:08.0289406Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:08.0321708Z Entering 'third_party/XNNPACK' 2025-12-04T08:54:08.0357408Z Entering 'third_party/aiter' 2025-12-04T08:54:08.0401456Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:08.0441282Z Entering 'third_party/benchmark' 2025-12-04T08:54:08.0475130Z Entering 'third_party/composable_kernel' 2025-12-04T08:54:08.0502089Z Entering 'third_party/cpp-httplib' 2025-12-04T08:54:08.0536909Z Entering 'third_party/cpuinfo' 2025-12-04T08:54:08.0585720Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:54:08.0611151Z Entering 'third_party/cutlass' 2025-12-04T08:54:08.0649197Z Entering 'third_party/fbgemm' 2025-12-04T08:54:08.0692212Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:08.0726525Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:08.0753510Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:08.0780919Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:08.0804562Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:08.0827205Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:08.0874816Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:54:08.0922100Z Entering 'third_party/flash-attention' 2025-12-04T08:54:08.0955485Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:08.0987548Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:08.1020712Z Entering 'third_party/flatbuffers' 2025-12-04T08:54:08.1047295Z Entering 'third_party/fmt' 2025-12-04T08:54:08.1071528Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:08.1105595Z Entering 'third_party/gloo' 2025-12-04T08:54:08.1140949Z Entering 'third_party/googletest' 2025-12-04T08:54:08.1167433Z Entering 'third_party/ideep' 2025-12-04T08:54:08.1203930Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:08.1239116Z Entering 'third_party/ittapi' 2025-12-04T08:54:08.1264197Z Entering 'third_party/kineto' 2025-12-04T08:54:08.1301039Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:08.1344475Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:08.1373546Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:08.1399912Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:08.1420296Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:08.1447154Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:08.1477067Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:08.1499202Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:08.1522850Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:08.1543399Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:08.1565336Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:08.1605999Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:08.1629085Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:08.1655067Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:08.1686639Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:08.1711578Z Entering 'third_party/kleidiai' 2025-12-04T08:54:08.1741749Z Entering 'third_party/mimalloc' 2025-12-04T08:54:08.1779312Z Entering 'third_party/nlohmann' 2025-12-04T08:54:08.1824038Z Entering 'third_party/onnx' 2025-12-04T08:54:08.1856787Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:08.1894564Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:54:08.1931279Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:08.1966930Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:08.1991688Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:08.2015616Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:08.2051836Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:08.2081572Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:08.2106153Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:08.2135419Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:08.2173867Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:08.2195511Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:08.2233477Z Entering 'third_party/pocketfft' 2025-12-04T08:54:08.2268341Z Entering 'third_party/protobuf' 2025-12-04T08:54:08.2301779Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:08.2334144Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:08.2370407Z Entering 'third_party/psimd' 2025-12-04T08:54:08.2395516Z Entering 'third_party/pthreadpool' 2025-12-04T08:54:08.2417978Z Entering 'third_party/pybind11' 2025-12-04T08:54:08.2442247Z Entering 'third_party/python-peachpy' 2025-12-04T08:54:08.2468976Z Entering 'third_party/sleef' 2025-12-04T08:54:08.2495247Z Entering 'third_party/tensorpipe' 2025-12-04T08:54:08.2524000Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:08.2556133Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:08.2584289Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:08.2607308Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:08.2635640Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:08.2690225Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T08:54:08.2709296Z http.https://github.com/.extraheader 2025-12-04T08:54:08.2724686Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-12-04T08:54:08.2758980Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T08:54:08.2982682Z Entering 'android/libs/fbjni' 2025-12-04T08:54:08.3019461Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3051092Z Entering 'third_party/FP16' 2025-12-04T08:54:08.3082666Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3116027Z Entering 'third_party/FXdiv' 2025-12-04T08:54:08.3140777Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3181647Z Entering 'third_party/NNPACK' 2025-12-04T08:54:08.3207616Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3232209Z Entering 'third_party/NVTX' 2025-12-04T08:54:08.3252607Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3285065Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:08.3303040Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3327691Z Entering 'third_party/XNNPACK' 2025-12-04T08:54:08.3351635Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3380442Z Entering 'third_party/aiter' 2025-12-04T08:54:08.3401691Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3425087Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:08.3439517Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3478609Z Entering 'third_party/benchmark' 2025-12-04T08:54:08.3505657Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3538875Z Entering 'third_party/composable_kernel' 2025-12-04T08:54:08.3560834Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3599955Z Entering 'third_party/cpp-httplib' 2025-12-04T08:54:08.3626372Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3647910Z Entering 'third_party/cpuinfo' 2025-12-04T08:54:08.3672396Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3693096Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:54:08.3706511Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3726459Z Entering 'third_party/cutlass' 2025-12-04T08:54:08.3745853Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3782494Z Entering 'third_party/fbgemm' 2025-12-04T08:54:08.3797913Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3832946Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:08.3848891Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3866510Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:08.3879967Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3926081Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:08.3941982Z http.https://github.com/.extraheader 2025-12-04T08:54:08.3958450Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:08.3978095Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4011989Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:08.4025381Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4053503Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:08.4066997Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4086108Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:54:08.4099247Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4120083Z Entering 'third_party/flash-attention' 2025-12-04T08:54:08.4133726Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4162407Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:08.4183303Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4204820Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:08.4224824Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4257185Z Entering 'third_party/flatbuffers' 2025-12-04T08:54:08.4286067Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4306321Z Entering 'third_party/fmt' 2025-12-04T08:54:08.4331746Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4348249Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:08.4362944Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4381699Z Entering 'third_party/gloo' 2025-12-04T08:54:08.4400862Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4442020Z Entering 'third_party/googletest' 2025-12-04T08:54:08.4461229Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4481858Z Entering 'third_party/ideep' 2025-12-04T08:54:08.4504103Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4532167Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:08.4550445Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4596056Z Entering 'third_party/ittapi' 2025-12-04T08:54:08.4623811Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4657117Z Entering 'third_party/kineto' 2025-12-04T08:54:08.4677187Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4698054Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:08.4711740Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4730098Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:08.4743858Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4774397Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:08.4787723Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4806603Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:08.4820852Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4837585Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:08.4856091Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4884031Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:08.4899385Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4918828Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:08.4932499Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4960705Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:08.4979664Z http.https://github.com/.extraheader 2025-12-04T08:54:08.4998668Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:08.5011872Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5029933Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:08.5044322Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5062869Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:08.5075574Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5103307Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:08.5116356Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5149015Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:08.5169430Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5212196Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:08.5225039Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5253705Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:08.5272779Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5305546Z Entering 'third_party/kleidiai' 2025-12-04T08:54:08.5333598Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5364765Z Entering 'third_party/mimalloc' 2025-12-04T08:54:08.5388098Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5406744Z Entering 'third_party/nlohmann' 2025-12-04T08:54:08.5430046Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5450044Z Entering 'third_party/onnx' 2025-12-04T08:54:08.5470275Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5493089Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:08.5514102Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5545860Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:54:08.5559495Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5580924Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:08.5607562Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5638107Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:08.5652021Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5684733Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:08.5712899Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5742412Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:08.5755907Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5774102Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:08.5792925Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5810216Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:08.5828098Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5846389Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:08.5859422Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5888099Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:08.5902523Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5920419Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:08.5939648Z http.https://github.com/.extraheader 2025-12-04T08:54:08.5958084Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:08.5978143Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6004936Z Entering 'third_party/pocketfft' 2025-12-04T08:54:08.6019263Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6036416Z Entering 'third_party/protobuf' 2025-12-04T08:54:08.6065022Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6097640Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:08.6112289Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6143556Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:08.6156679Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6187057Z Entering 'third_party/psimd' 2025-12-04T08:54:08.6201382Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6230169Z Entering 'third_party/pthreadpool' 2025-12-04T08:54:08.6250796Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6279756Z Entering 'third_party/pybind11' 2025-12-04T08:54:08.6309264Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6328912Z Entering 'third_party/python-peachpy' 2025-12-04T08:54:08.6345571Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6373599Z Entering 'third_party/sleef' 2025-12-04T08:54:08.6400602Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6420270Z Entering 'third_party/tensorpipe' 2025-12-04T08:54:08.6434688Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6463292Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:08.6476580Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6493631Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:08.6506280Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6524030Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:08.6550292Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6569052Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:08.6592993Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6620768Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:08.6635408Z http.https://github.com/.extraheader 2025-12-04T08:54:08.6690442Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.6734138Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T08:54:08.6930207Z Entering 'android/libs/fbjni' 2025-12-04T08:54:08.6946941Z file:/home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T08:54:08.6967685Z Entering 'third_party/FP16' 2025-12-04T08:54:08.6979142Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T08:54:08.6989199Z Entering 'third_party/FXdiv' 2025-12-04T08:54:08.7000694Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T08:54:08.7021739Z Entering 'third_party/NNPACK' 2025-12-04T08:54:08.7034332Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T08:54:08.7043259Z Entering 'third_party/NVTX' 2025-12-04T08:54:08.7054208Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T08:54:08.7063986Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:08.7075284Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T08:54:08.7084658Z Entering 'third_party/XNNPACK' 2025-12-04T08:54:08.7096050Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T08:54:08.7110273Z Entering 'third_party/aiter' 2025-12-04T08:54:08.7121124Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T08:54:08.7131678Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:08.7142076Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T08:54:08.7155778Z Entering 'third_party/benchmark' 2025-12-04T08:54:08.7168419Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:54:08.7177691Z Entering 'third_party/composable_kernel' 2025-12-04T08:54:08.7189065Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T08:54:08.7201738Z Entering 'third_party/cpp-httplib' 2025-12-04T08:54:08.7212141Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T08:54:08.7221087Z Entering 'third_party/cpuinfo' 2025-12-04T08:54:08.7231698Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T08:54:08.7240880Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:54:08.7251184Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T08:54:08.7261151Z Entering 'third_party/cutlass' 2025-12-04T08:54:08.7272715Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T08:54:08.7285609Z Entering 'third_party/fbgemm' 2025-12-04T08:54:08.7295868Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T08:54:08.7306973Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:08.7326326Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T08:54:08.7346477Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:08.7357058Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T08:54:08.7368971Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:08.7379904Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T08:54:08.7399663Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:08.7410531Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T08:54:08.7438604Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:08.7455793Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T08:54:08.7464058Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:08.7483323Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T08:54:08.7492713Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:54:08.7502601Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T08:54:08.7514487Z Entering 'third_party/flash-attention' 2025-12-04T08:54:08.7532115Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T08:54:08.7542475Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:08.7553131Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T08:54:08.7565243Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:08.7578817Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T08:54:08.7591974Z Entering 'third_party/flatbuffers' 2025-12-04T08:54:08.7602420Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T08:54:08.7614240Z Entering 'third_party/fmt' 2025-12-04T08:54:08.7625263Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T08:54:08.7634566Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:08.7645114Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T08:54:08.7665154Z Entering 'third_party/gloo' 2025-12-04T08:54:08.7690189Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T08:54:08.7710370Z Entering 'third_party/googletest' 2025-12-04T08:54:08.7730219Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:08.7740099Z Entering 'third_party/ideep' 2025-12-04T08:54:08.7749529Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T08:54:08.7759837Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:08.7767735Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T08:54:08.7778149Z Entering 'third_party/ittapi' 2025-12-04T08:54:08.7803870Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T08:54:08.7804639Z Entering 'third_party/kineto' 2025-12-04T08:54:08.7805287Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T08:54:08.7808427Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:08.7816901Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T08:54:08.7823573Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:08.7833924Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T08:54:08.7841917Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:08.7861791Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T08:54:08.7870946Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:08.7898560Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T08:54:08.7906314Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:08.7926679Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T08:54:08.7934779Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:08.7950104Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T08:54:08.7959125Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:08.7968643Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T08:54:08.7975743Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:08.7990040Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:08.7997634Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:08.8005543Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T08:54:08.8012518Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:08.8026325Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T08:54:08.8034559Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:08.8043007Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T08:54:08.8049346Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:08.8059432Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T08:54:08.8067193Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:08.8086379Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T08:54:08.8095842Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:08.8107536Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T08:54:08.8114369Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:08.8124246Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T08:54:08.8131710Z Entering 'third_party/kleidiai' 2025-12-04T08:54:08.8139829Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T08:54:08.8147232Z Entering 'third_party/mimalloc' 2025-12-04T08:54:08.8170646Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T08:54:08.8178429Z Entering 'third_party/nlohmann' 2025-12-04T08:54:08.8190172Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T08:54:08.8199702Z Entering 'third_party/onnx' 2025-12-04T08:54:08.8209538Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T08:54:08.8223883Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:08.8232260Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:54:08.8240556Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:54:08.8248681Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T08:54:08.8255449Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:08.8266276Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:54:08.8274249Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:08.8283591Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:08.8290869Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:08.8300774Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T08:54:08.8307722Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:08.8318829Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T08:54:08.8341865Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:08.8357602Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T08:54:08.8369542Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:08.8396211Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T08:54:08.8406243Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:08.8431084Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T08:54:08.8448269Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:08.8464007Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T08:54:08.8474846Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:08.8490074Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T08:54:08.8506708Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:08.8529849Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T08:54:08.8549697Z Entering 'third_party/pocketfft' 2025-12-04T08:54:08.8561778Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T08:54:08.8571309Z Entering 'third_party/protobuf' 2025-12-04T08:54:08.8582374Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T08:54:08.8595056Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:08.8611409Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:54:08.8619904Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:08.8629908Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:08.8640520Z Entering 'third_party/psimd' 2025-12-04T08:54:08.8651187Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T08:54:08.8660796Z Entering 'third_party/pthreadpool' 2025-12-04T08:54:08.8671427Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T08:54:08.8680628Z Entering 'third_party/pybind11' 2025-12-04T08:54:08.8691279Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:54:08.8700814Z Entering 'third_party/python-peachpy' 2025-12-04T08:54:08.8715662Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T08:54:08.8736507Z Entering 'third_party/sleef' 2025-12-04T08:54:08.8757361Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T08:54:08.8784496Z Entering 'third_party/tensorpipe' 2025-12-04T08:54:08.8803382Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T08:54:08.8818574Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:08.8830327Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:08.8840569Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:08.8862199Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T08:54:08.8871743Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:08.8883504Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T08:54:08.8893023Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:08.8910127Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:54:08.8919467Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:08.8929726Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T08:54:08.8967915Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.8999417Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9045409Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9072523Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9099964Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9125594Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9160534Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9184747Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9207493Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9231091Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9256345Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9291396Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9327356Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9349848Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9373462Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9399187Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9423825Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9461147Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9494762Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9529228Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9555886Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9578081Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9601815Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9637445Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9663579Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9696235Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9720978Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9756689Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9791955Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9816095Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9838416Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9872009Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9906384Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9932959Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9966664Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:08.9991763Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0030749Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0070734Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0096890Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0122149Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0148169Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0184691Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0220730Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0245077Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0281154Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0317149Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0351728Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0389222Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0423969Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0462003Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0495833Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0528427Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0563344Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0587464Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0621568Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0657692Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0683502Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0720154Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0757586Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0781960Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0807872Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0843960Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0872186Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0899551Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0925468Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0960382Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.0993267Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1018725Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1053620Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1089655Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1115270Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1154963Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1181898Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1214494Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1243173Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1279411Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1312491Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1337646Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1364781Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1399960Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1426771Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T08:54:09.1466921Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T08:54:09.1517612Z ##[endgroup] 2025-12-04T08:54:09.1518191Z ##[group]Fetching the repository 2025-12-04T08:54:09.1528669Z [command]/usr/bin/git -c protocol.version=2 fetch --prune --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/* 2025-12-04T08:54:10.6501526Z [command]/usr/bin/git rev-parse --verify --quiet ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32^{object} 2025-12-04T08:54:10.6608876Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:54:10.6612641Z ##[endgroup] 2025-12-04T08:54:10.6613245Z ##[group]Determining the checkout info 2025-12-04T08:54:10.6613841Z ##[endgroup] 2025-12-04T08:54:10.6617437Z [command]/usr/bin/git sparse-checkout disable 2025-12-04T08:54:10.6710759Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-12-04T08:54:10.6737582Z ##[group]Checking out the ref 2025-12-04T08:54:10.6738659Z [command]/usr/bin/git checkout --progress --force ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:54:10.7012412Z HEAD is now at ffd9b0fb4355 Resolve collective autotuning test failure on arm (#168919) 2025-12-04T08:54:10.7018536Z ##[endgroup] 2025-12-04T08:54:10.7019092Z ##[group]Setting up auth for fetching submodules 2025-12-04T08:54:10.7026696Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T08:54:10.7063679Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-12-04T08:54:10.7099921Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-12-04T08:54:10.7133523Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-12-04T08:54:10.7160689Z ##[endgroup] 2025-12-04T08:54:10.7161219Z ##[group]Fetching submodules 2025-12-04T08:54:10.7166452Z [command]/usr/bin/git submodule sync --recursive 2025-12-04T08:54:10.7393368Z Synchronizing submodule url for 'android/libs/fbjni' 2025-12-04T08:54:10.7405046Z Synchronizing submodule url for 'third_party/FP16' 2025-12-04T08:54:10.7416510Z Synchronizing submodule url for 'third_party/FXdiv' 2025-12-04T08:54:10.7428344Z Synchronizing submodule url for 'third_party/NNPACK' 2025-12-04T08:54:10.7440046Z Synchronizing submodule url for 'third_party/NVTX' 2025-12-04T08:54:10.7450662Z Synchronizing submodule url for 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:10.7461585Z Synchronizing submodule url for 'third_party/XNNPACK' 2025-12-04T08:54:10.7478111Z Synchronizing submodule url for 'third_party/aiter' 2025-12-04T08:54:10.7492542Z Synchronizing submodule url for 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:10.7508795Z Synchronizing submodule url for 'third_party/benchmark' 2025-12-04T08:54:10.7531677Z Synchronizing submodule url for 'third_party/composable_kernel' 2025-12-04T08:54:10.7547227Z Synchronizing submodule url for 'third_party/cpp-httplib' 2025-12-04T08:54:10.7558868Z Synchronizing submodule url for 'third_party/cpuinfo' 2025-12-04T08:54:10.7569998Z Synchronizing submodule url for 'third_party/cudnn_frontend' 2025-12-04T08:54:10.7580777Z Synchronizing submodule url for 'third_party/cutlass' 2025-12-04T08:54:10.7595165Z Synchronizing submodule url for 'third_party/fbgemm' 2025-12-04T08:54:10.7608849Z Synchronizing submodule url for 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:10.7629952Z Synchronizing submodule url for 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:10.7657328Z Synchronizing submodule url for 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:10.7668656Z Synchronizing submodule url for 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:10.7699245Z Synchronizing submodule url for 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:10.7723641Z Synchronizing submodule url for 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:10.7744136Z Synchronizing submodule url for 'third_party/fbgemm/external/json' 2025-12-04T08:54:10.7757631Z Synchronizing submodule url for 'third_party/flash-attention' 2025-12-04T08:54:10.7768891Z Synchronizing submodule url for 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:10.7797868Z Synchronizing submodule url for 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:10.7814771Z Synchronizing submodule url for 'third_party/flatbuffers' 2025-12-04T08:54:10.7827236Z Synchronizing submodule url for 'third_party/fmt' 2025-12-04T08:54:10.7856075Z Synchronizing submodule url for 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:10.7879366Z Synchronizing submodule url for 'third_party/gloo' 2025-12-04T08:54:10.7890804Z Synchronizing submodule url for 'third_party/googletest' 2025-12-04T08:54:10.7905817Z Synchronizing submodule url for 'third_party/ideep' 2025-12-04T08:54:10.7917660Z Synchronizing submodule url for 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:10.7954370Z Synchronizing submodule url for 'third_party/ittapi' 2025-12-04T08:54:10.7965791Z Synchronizing submodule url for 'third_party/kineto' 2025-12-04T08:54:10.7981299Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:10.7996739Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:10.8009391Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:10.8023338Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:10.8036333Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:10.8049019Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:10.8062191Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:10.8088231Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:10.8103172Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:10.8116026Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:10.8145582Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:10.8179284Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:10.8206904Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:10.8242393Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:10.8258042Z Synchronizing submodule url for 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:10.8276274Z Synchronizing submodule url for 'third_party/kleidiai' 2025-12-04T08:54:10.8302099Z Synchronizing submodule url for 'third_party/mimalloc' 2025-12-04T08:54:10.8323516Z Synchronizing submodule url for 'third_party/nlohmann' 2025-12-04T08:54:10.8338255Z Synchronizing submodule url for 'third_party/onnx' 2025-12-04T08:54:10.8359979Z Synchronizing submodule url for 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:10.8374175Z Synchronizing submodule url for 'third_party/opentelemetry-cpp' 2025-12-04T08:54:10.8408421Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:10.8423964Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:10.8434971Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:10.8449688Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:10.8461000Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:10.8471480Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:10.8483041Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:10.8494808Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:10.8508744Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:10.8523985Z Synchronizing submodule url for 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:10.8543757Z Synchronizing submodule url for 'third_party/pocketfft' 2025-12-04T08:54:10.8555523Z Synchronizing submodule url for 'third_party/protobuf' 2025-12-04T08:54:10.8567435Z Synchronizing submodule url for 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:10.8579504Z Synchronizing submodule url for 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:10.8594614Z Synchronizing submodule url for 'third_party/psimd' 2025-12-04T08:54:10.8610333Z Synchronizing submodule url for 'third_party/pthreadpool' 2025-12-04T08:54:10.8629372Z Synchronizing submodule url for 'third_party/pybind11' 2025-12-04T08:54:10.8654244Z Synchronizing submodule url for 'third_party/python-peachpy' 2025-12-04T08:54:10.8666883Z Synchronizing submodule url for 'third_party/sleef' 2025-12-04T08:54:10.8690646Z Synchronizing submodule url for 'third_party/tensorpipe' 2025-12-04T08:54:10.8712057Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:10.8723984Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:10.8735224Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:10.8756768Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:10.8771325Z Synchronizing submodule url for 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:10.8805297Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --recursive 2025-12-04T08:54:10.9112119Z Submodule path 'android/libs/fbjni': checked out '7e1e1fe3858c63c251c637ae41a20de425dde96f' 2025-12-04T08:54:10.9168622Z Submodule path 'third_party/FP16': checked out '4dfe081cf6bcd15db339cf2680b9281b8451eeb3' 2025-12-04T08:54:10.9229199Z Submodule path 'third_party/FXdiv': checked out 'b408327ac2a15ec3e43352421954f5b1967701d1' 2025-12-04T08:54:10.9278256Z Submodule path 'third_party/NNPACK': checked out 'c07e3a0400713d546e0dea2d5466dd22ea389c73' 2025-12-04T08:54:10.9337214Z Submodule path 'third_party/NVTX': checked out '3ebbc93ded7285963bff932c678fa367eb393ba6' 2025-12-04T08:54:10.9412790Z Submodule path 'third_party/VulkanMemoryAllocator': checked out '1d8f600fd424278486eade7ed3e877c99f0846b1' 2025-12-04T08:54:10.9571521Z Submodule path 'third_party/XNNPACK': checked out '51a0103656eff6fc9bfd39a4597923c4b542c883' 2025-12-04T08:54:10.9750208Z Submodule path 'third_party/aiter': checked out '01aae101b9e5e94d6c16a9514c9fb8df99c93150' 2025-12-04T08:54:10.9960616Z Submodule path 'third_party/aiter/3rdparty/composable_kernel': checked out 'cffe8fa2a442ac8e80dd236a1a5d24fe3d7e0cbf' 2025-12-04T08:54:11.0047258Z Submodule path 'third_party/benchmark': checked out '299e5928955cc62af9968370293b916f5130916f' 2025-12-04T08:54:11.0274928Z Submodule path 'third_party/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T08:54:11.0378870Z Submodule path 'third_party/cpp-httplib': checked out '89c932f313c6437c38f2982869beacc89c2f2246' 2025-12-04T08:54:11.0457249Z Submodule path 'third_party/cpuinfo': checked out 'f858c30bcb16f8effd5ff46996f0514539e17abc' 2025-12-04T08:54:11.0555507Z Submodule path 'third_party/cudnn_frontend': checked out '0b1577c8c83401237d601d0d0db5210506705396' 2025-12-04T08:54:11.0694628Z Submodule path 'third_party/cutlass': checked out 'f88806b1e31dfa579842638740216dd41fc6c588' 2025-12-04T08:54:11.0843124Z Submodule path 'third_party/fbgemm': checked out 'c0b988d39a9e47c794d699f29930ed4d7c7e13a4' 2025-12-04T08:54:11.0913898Z Submodule path 'third_party/fbgemm/external/asmjit': checked out 'a3199e8857792cd10b7589ff5d58343d2c9008ea' 2025-12-04T08:54:11.1105117Z Submodule path 'third_party/fbgemm/external/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T08:54:11.1198323Z Submodule path 'third_party/fbgemm/external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-12-04T08:54:11.1330675Z Submodule path 'third_party/fbgemm/external/cutlass': checked out '98125ce499b0fdf7ffbe0e3052f5b8709f4840f8' 2025-12-04T08:54:11.1421224Z Submodule path 'third_party/fbgemm/external/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T08:54:11.1485686Z Submodule path 'third_party/fbgemm/external/hipify_torch': checked out '63b6a7b541fa7f08f8475ca7d74054db36ff2691' 2025-12-04T08:54:11.1596561Z Submodule path 'third_party/fbgemm/external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-12-04T08:54:11.1729831Z Submodule path 'third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5' 2025-12-04T08:54:11.1954541Z Submodule path 'third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33' 2025-12-04T08:54:11.2081469Z Submodule path 'third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420' 2025-12-04T08:54:11.2206368Z Submodule path 'third_party/flatbuffers': checked out 'a2cd1ea3b6d3fee220106b5fed3f7ce8da9eb757' 2025-12-04T08:54:11.2302377Z Submodule path 'third_party/fmt': checked out '407c905e45ad75fc29bf0f9bb7c5c2fd3475976f' 2025-12-04T08:54:11.2385421Z Submodule path 'third_party/gemmlowp/gemmlowp': checked out '3fb5c176c17c765a3492cd2f0321b0dab712f350' 2025-12-04T08:54:11.2454874Z Submodule path 'third_party/gloo': checked out '54cbae0d3a67fa890b4c3d9ee162b7860315e341' 2025-12-04T08:54:11.2510215Z Submodule path 'third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T08:54:11.2574749Z Submodule path 'third_party/ideep': checked out '719d8e6cd7f7a0e01b155657526d693acf97c2b3' 2025-12-04T08:54:11.2767347Z Submodule path 'third_party/ideep/mkl-dnn': checked out '8d263e693366ef8db40acc569cc7d8edf644556d' 2025-12-04T08:54:11.2860670Z Submodule path 'third_party/ittapi': checked out 'dec1d23ca65ab069d225dfe40dea14f455170959' 2025-12-04T08:54:11.2950630Z Submodule path 'third_party/kineto': checked out '31f85df8fbd89c188f14ef10f1ec65379786b943' 2025-12-04T08:54:11.3067017Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1' 2025-12-04T08:54:11.3179369Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM': checked out 'ffde4e54bc7249a6039a5e6b45b395141e1217f9' 2025-12-04T08:54:11.3267181Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr': checked out '871ed52d350214a034f6ef8a3b8f51c5ce1bd400' 2025-12-04T08:54:11.3342707Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt': checked out 'cd4af11efc9c622896a3e4cb599fa28668ca3d05' 2025-12-04T08:54:11.3432876Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags': checked out 'e171aa2d15ed9eb17054558e0b3a6a413bb01067' 2025-12-04T08:54:11.3507230Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc': checked out '8411df715cf522606e3b1aca386ddfc0b63d34b4' 2025-12-04T08:54:11.3584206Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog': checked out 'b33e3bad4c46c8a6345525fd822af355e5ef9446' 2025-12-04T08:54:11.3637004Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T08:54:11.3729031Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json': checked out '4f8fba14066156b73f1189a2b8bd568bde5284c5' 2025-12-04T08:54:11.3797400Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs': checked out 'f68a2fa8ea36c783bdd760371411fcb495aa3150' 2025-12-04T08:54:11.3859933Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a' 2025-12-04T08:54:11.3974112Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159' 2025-12-04T08:54:11.4038296Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T08:54:11.4099906Z Submodule path 'third_party/kineto/libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21' 2025-12-04T08:54:11.4177148Z Submodule path 'third_party/kineto/libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T08:54:11.4296842Z Submodule path 'third_party/kleidiai': checked out 'd7770c89632329a9914ef1a90289917597639cbe' 2025-12-04T08:54:11.4388308Z Submodule path 'third_party/mimalloc': checked out 'fbd8b99c2b828428947d70fdc046bb55609be93e' 2025-12-04T08:54:11.4514685Z Submodule path 'third_party/nlohmann': checked out '55f93686c01528224f448c19128836e7df245f72' 2025-12-04T08:54:11.4684174Z Submodule path 'third_party/onnx': checked out 'e709452ef2bbc1d113faf678c24e6d3467696e83' 2025-12-04T08:54:11.4770410Z Submodule path 'third_party/onnx/third_party/pybind11': checked out 'a2e59f0e7065404b44dfe92a28aca47ba1378dc4' 2025-12-04T08:54:11.4896215Z Submodule path 'third_party/opentelemetry-cpp': checked out 'a799f4aed9c94b765dcdaabaeab7d5e7e2310878' 2025-12-04T08:54:11.4958069Z Submodule path 'third_party/opentelemetry-cpp/third_party/benchmark': checked out 'd572f4777349d43653b21d6c2fc63020ab326db2' 2025-12-04T08:54:11.5029369Z Submodule path 'third_party/opentelemetry-cpp/third_party/googletest': checked out 'b796f7d44681514f58a683a3a71ff17c94edb0c1' 2025-12-04T08:54:11.5089753Z Submodule path 'third_party/opentelemetry-cpp/third_party/ms-gsl': checked out '6f4529395c5b7c2d661812257cd6780c67e54afa' 2025-12-04T08:54:11.5183937Z Submodule path 'third_party/opentelemetry-cpp/third_party/nlohmann-json': checked out 'bc889afb4c5bf1c0d8ee29ef35eaaf4c8bef8a5d' 2025-12-04T08:54:11.5248108Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto': checked out '4ca4f0335c63cda7ab31ea7ed70d6553aee14dce' 2025-12-04T08:54:11.5301253Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentracing-cpp': checked out '06b57f48ded1fa3bdd3d4346f6ef29e40e08eaf5' 2025-12-04T08:54:11.5374739Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp': checked out 'c9ffcdda9086ffd9e1283ea7a0276d831f3c8a8d' 2025-12-04T08:54:11.5498151Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'eefb26f82b233268fc98577d265352720d477ba4' 2025-12-04T08:54:11.5574780Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T08:54:11.5710048Z Submodule path 'third_party/opentelemetry-cpp/tools/vcpkg': checked out '8eb57355a4ffb410a2e94c07b4dca2dffbee8e50' 2025-12-04T08:54:11.5785310Z Submodule path 'third_party/pocketfft': checked out '0fa0ef591e38c2758e3184c6c23e497b9f732ffa' 2025-12-04T08:54:11.5960609Z Submodule path 'third_party/protobuf': checked out 'd1eca4e4b421cd2997495c4b4e65cea6be4e9b8a' 2025-12-04T08:54:11.6035832Z Submodule path 'third_party/protobuf/third_party/benchmark': checked out '5b7683f49e1e9223cf9927b24f6fd3d6bd82e3f8' 2025-12-04T08:54:11.6098885Z Submodule path 'third_party/protobuf/third_party/googletest': checked out '5ec7f0c4a113e2f18ac2c6cc7df51ad6afc24081' 2025-12-04T08:54:11.6149098Z Submodule path 'third_party/psimd': checked out '072586a71b55b7f8c584153d223e95687148a900' 2025-12-04T08:54:11.6198426Z Submodule path 'third_party/pthreadpool': checked out '4fe0e1e183925bf8cfa6aae24237e724a96479b8' 2025-12-04T08:54:11.6266867Z Submodule path 'third_party/pybind11': checked out 'f5fbe867d2d26e4a0a9177a51f6e568868ad3dc8' 2025-12-04T08:54:11.6317308Z Submodule path 'third_party/python-peachpy': checked out 'f45429b087dd7d5bc78bb40dc7cf06425c252d67' 2025-12-04T08:54:11.6372229Z Submodule path 'third_party/sleef': checked out '5a1d179df9cf652951b59010a2d2075372d67f68' 2025-12-04T08:54:11.6432353Z Submodule path 'third_party/tensorpipe': checked out '2b4cd91092d335a697416b2a3cb398283246849d' 2025-12-04T08:54:11.6507491Z Submodule path 'third_party/tensorpipe/third_party/googletest': checked out 'aee0f9d9b5b87796ee8a0ab26b7587ec30e8858e' 2025-12-04T08:54:11.6565706Z Submodule path 'third_party/tensorpipe/third_party/libnop': checked out '910b55815be16109f04f4180e9adee14fb4ce281' 2025-12-04T08:54:11.6681350Z Submodule path 'third_party/tensorpipe/third_party/libuv': checked out '5152db2cbfeb5582e9c27c5ea1dba2cd9e10759b' 2025-12-04T08:54:11.6764435Z Submodule path 'third_party/tensorpipe/third_party/pybind11': checked out 'a23996fce38ff6ccfbcdc09f1e63f2c4be5ea2ef' 2025-12-04T08:54:11.6817820Z Submodule path 'third_party/tensorpipe/third_party/pybind11/tools/clang': checked out '6a00cbc4a9b8e68b71caf7f774b3f9c753ae84d5' 2025-12-04T08:54:11.6845529Z [command]/usr/bin/git submodule foreach --recursive git config --local gc.auto 0 2025-12-04T08:54:11.6950240Z Entering 'android/libs/fbjni' 2025-12-04T08:54:11.6965133Z Entering 'third_party/FP16' 2025-12-04T08:54:11.6982782Z Entering 'third_party/FXdiv' 2025-12-04T08:54:11.7002858Z Entering 'third_party/NNPACK' 2025-12-04T08:54:11.7020314Z Entering 'third_party/NVTX' 2025-12-04T08:54:11.7034968Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:11.7049372Z Entering 'third_party/XNNPACK' 2025-12-04T08:54:11.7072067Z Entering 'third_party/aiter' 2025-12-04T08:54:11.7090155Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:11.7107048Z Entering 'third_party/benchmark' 2025-12-04T08:54:11.7123372Z Entering 'third_party/composable_kernel' 2025-12-04T08:54:11.7141235Z Entering 'third_party/cpp-httplib' 2025-12-04T08:54:11.7155088Z Entering 'third_party/cpuinfo' 2025-12-04T08:54:11.7169227Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:54:11.7183560Z Entering 'third_party/cutlass' 2025-12-04T08:54:11.7204010Z Entering 'third_party/fbgemm' 2025-12-04T08:54:11.7218928Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:11.7232371Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:11.7249243Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:11.7265294Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:11.7283171Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:11.7296398Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:11.7311373Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:54:11.7329296Z Entering 'third_party/flash-attention' 2025-12-04T08:54:11.7343793Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:11.7359806Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:11.7379595Z Entering 'third_party/flatbuffers' 2025-12-04T08:54:11.7394463Z Entering 'third_party/fmt' 2025-12-04T08:54:11.7407980Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:11.7421723Z Entering 'third_party/gloo' 2025-12-04T08:54:11.7435180Z Entering 'third_party/googletest' 2025-12-04T08:54:11.7448636Z Entering 'third_party/ideep' 2025-12-04T08:54:11.7465342Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:11.7481355Z Entering 'third_party/ittapi' 2025-12-04T08:54:11.7495244Z Entering 'third_party/kineto' 2025-12-04T08:54:11.7512321Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:11.7529038Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:11.7543721Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:11.7560689Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:11.7575805Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:11.7590052Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:11.7604268Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:11.7618866Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:11.7632663Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:11.7650090Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:11.7665841Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:11.7679218Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:11.7693660Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:11.7708504Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:11.7721894Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:11.7738574Z Entering 'third_party/kleidiai' 2025-12-04T08:54:11.7755841Z Entering 'third_party/mimalloc' 2025-12-04T08:54:11.7770050Z Entering 'third_party/nlohmann' 2025-12-04T08:54:11.7783993Z Entering 'third_party/onnx' 2025-12-04T08:54:11.7804158Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:11.7819987Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:54:11.7832767Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:11.7849139Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:11.7862242Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:11.7879017Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:11.7894469Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:11.7909354Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:11.7924097Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:11.7938056Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:11.7954763Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:11.7972138Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:11.7994904Z Entering 'third_party/pocketfft' 2025-12-04T08:54:11.8008808Z Entering 'third_party/protobuf' 2025-12-04T08:54:11.8023864Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:11.8058137Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:11.8072166Z Entering 'third_party/psimd' 2025-12-04T08:54:11.8085752Z Entering 'third_party/pthreadpool' 2025-12-04T08:54:11.8107891Z Entering 'third_party/pybind11' 2025-12-04T08:54:11.8121988Z Entering 'third_party/python-peachpy' 2025-12-04T08:54:11.8136175Z Entering 'third_party/sleef' 2025-12-04T08:54:11.8149739Z Entering 'third_party/tensorpipe' 2025-12-04T08:54:11.8163113Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:11.8176781Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:11.8190088Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:11.8206644Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:11.8220981Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:11.8247056Z ##[endgroup] 2025-12-04T08:54:11.8247602Z ##[group]Persisting credentials for submodules 2025-12-04T08:54:11.8254618Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-12-04T08:54:11.8382874Z Entering 'android/libs/fbjni' 2025-12-04T08:54:11.8394397Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8394793Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8408812Z Entering 'third_party/FP16' 2025-12-04T08:54:11.8417546Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8417932Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8431580Z Entering 'third_party/FXdiv' 2025-12-04T08:54:11.8440623Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8441004Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8452956Z Entering 'third_party/NNPACK' 2025-12-04T08:54:11.8464459Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8464836Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8493455Z Entering 'third_party/NVTX' 2025-12-04T08:54:11.8509887Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8510271Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8529635Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:11.8543490Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8543869Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8573176Z Entering 'third_party/XNNPACK' 2025-12-04T08:54:11.8590244Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8590630Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8613278Z Entering 'third_party/aiter' 2025-12-04T08:54:11.8629126Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8629512Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8647766Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:11.8659430Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8659813Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8683201Z Entering 'third_party/benchmark' 2025-12-04T08:54:11.8694276Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8694662Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8709097Z Entering 'third_party/composable_kernel' 2025-12-04T08:54:11.8717929Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8718309Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8734327Z Entering 'third_party/cpp-httplib' 2025-12-04T08:54:11.8755140Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8755521Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8771967Z Entering 'third_party/cpuinfo' 2025-12-04T08:54:11.8782678Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8783059Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8795133Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:54:11.8805832Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8806278Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8826617Z Entering 'third_party/cutlass' 2025-12-04T08:54:11.8852116Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8852496Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8887680Z Entering 'third_party/fbgemm' 2025-12-04T08:54:11.8901226Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8901606Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8917893Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:11.8934524Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8934904Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8952062Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:11.8962759Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8963142Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8982690Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:11.8993567Z url.https://github.com/.insteadof 2025-12-04T08:54:11.8993951Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9006627Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:11.9017419Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9017805Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9036265Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:11.9050134Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9050518Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9064174Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:11.9074298Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9074681Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9089440Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:54:11.9103627Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9104005Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9124679Z Entering 'third_party/flash-attention' 2025-12-04T08:54:11.9136825Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9137208Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9154566Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:11.9176446Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9176827Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9200191Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:11.9213196Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9213575Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9255206Z Entering 'third_party/flatbuffers' 2025-12-04T08:54:11.9287239Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9287629Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9309336Z Entering 'third_party/fmt' 2025-12-04T08:54:11.9332595Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9332972Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9355189Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:11.9408010Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9408385Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9457330Z Entering 'third_party/gloo' 2025-12-04T08:54:11.9497257Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9497640Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9526736Z Entering 'third_party/googletest' 2025-12-04T08:54:11.9553953Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9554342Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9574681Z Entering 'third_party/ideep' 2025-12-04T08:54:11.9608146Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9608526Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9648402Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:11.9678308Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9678941Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9701816Z Entering 'third_party/ittapi' 2025-12-04T08:54:11.9717712Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9718098Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9736362Z Entering 'third_party/kineto' 2025-12-04T08:54:11.9766146Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9766525Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9793645Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:11.9813771Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9814174Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9839195Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:11.9860768Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9861158Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9879364Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:11.9902966Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9903368Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9921018Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:11.9946374Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9946760Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9978271Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:11.9998122Z url.https://github.com/.insteadof 2025-12-04T08:54:11.9998508Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0019939Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:12.0038016Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0038403Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0062284Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:12.0086943Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0087329Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0130715Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:12.0163390Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0163867Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0188188Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:12.0215815Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0216317Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0239485Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:12.0265642Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0266582Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0288701Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:12.0316346Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0316798Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0337964Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:12.0365405Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0365875Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0388556Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:12.0404015Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0404913Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0432900Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:12.0448339Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0448545Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0477929Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:12.0495243Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0495728Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0518332Z Entering 'third_party/kleidiai' 2025-12-04T08:54:12.0543191Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0543657Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0568585Z Entering 'third_party/mimalloc' 2025-12-04T08:54:12.0584073Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0584534Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0606361Z Entering 'third_party/nlohmann' 2025-12-04T08:54:12.0623622Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0624088Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0644717Z Entering 'third_party/onnx' 2025-12-04T08:54:12.0660221Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0660692Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0688820Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:12.0714282Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0714463Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0748925Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:54:12.0775530Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0776294Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0795122Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:12.0809926Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0810383Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0846411Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:12.0859435Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0859665Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0875199Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:12.0926504Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0926692Z url.https://github.com/.insteadof 2025-12-04T08:54:12.0967514Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:12.0977027Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1006219Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1006419Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:12.1027971Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1028104Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1076524Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:12.1096817Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1096989Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1122028Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:12.1140369Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1140525Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1158267Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:12.1169273Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1169790Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1183222Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:12.1226411Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1226553Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1233120Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:12.1271912Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1275328Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1305960Z Entering 'third_party/pocketfft' 2025-12-04T08:54:12.1328033Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1328207Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1356081Z Entering 'third_party/protobuf' 2025-12-04T08:54:12.1378778Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1378942Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1410296Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:12.1425366Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1425527Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1510163Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:12.1523676Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1523939Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1587195Z Entering 'third_party/psimd' 2025-12-04T08:54:12.1622281Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1622740Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1656357Z Entering 'third_party/pthreadpool' 2025-12-04T08:54:12.1667519Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1667682Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1697797Z Entering 'third_party/pybind11' 2025-12-04T08:54:12.1705428Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1705665Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1746990Z Entering 'third_party/python-peachpy' 2025-12-04T08:54:12.1757618Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1757943Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1789930Z Entering 'third_party/sleef' 2025-12-04T08:54:12.1808756Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1808940Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1886334Z Entering 'third_party/tensorpipe' 2025-12-04T08:54:12.1904707Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1904876Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1931884Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:12.1947880Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1948313Z url.https://github.com/.insteadof 2025-12-04T08:54:12.1999636Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:12.2026286Z url.https://github.com/.insteadof 2025-12-04T08:54:12.2026498Z url.https://github.com/.insteadof 2025-12-04T08:54:12.2048319Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:12.2064364Z url.https://github.com/.insteadof 2025-12-04T08:54:12.2064548Z url.https://github.com/.insteadof 2025-12-04T08:54:12.2106159Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:12.2106310Z url.https://github.com/.insteadof 2025-12-04T08:54:12.2106641Z url.https://github.com/.insteadof 2025-12-04T08:54:12.2118752Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:12.2156066Z url.https://github.com/.insteadof 2025-12-04T08:54:12.2156285Z url.https://github.com/.insteadof 2025-12-04T08:54:12.2222777Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-12-04T08:54:12.2605668Z Entering 'android/libs/fbjni' 2025-12-04T08:54:12.2640279Z file:/home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T08:54:12.2667569Z Entering 'third_party/FP16' 2025-12-04T08:54:12.2716339Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T08:54:12.2739435Z Entering 'third_party/FXdiv' 2025-12-04T08:54:12.2776539Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T08:54:12.2788082Z Entering 'third_party/NNPACK' 2025-12-04T08:54:12.2850909Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T08:54:12.2885150Z Entering 'third_party/NVTX' 2025-12-04T08:54:12.2903432Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T08:54:12.2926265Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:12.2973811Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T08:54:12.3006692Z Entering 'third_party/XNNPACK' 2025-12-04T08:54:12.3010942Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T08:54:12.3033598Z Entering 'third_party/aiter' 2025-12-04T08:54:12.3120514Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T08:54:12.3135414Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:12.3187970Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T08:54:12.3256822Z Entering 'third_party/benchmark' 2025-12-04T08:54:12.3270472Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:54:12.3285091Z Entering 'third_party/composable_kernel' 2025-12-04T08:54:12.3318288Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T08:54:12.3364661Z Entering 'third_party/cpp-httplib' 2025-12-04T08:54:12.3409762Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T08:54:12.3457715Z Entering 'third_party/cpuinfo' 2025-12-04T08:54:12.3504560Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T08:54:12.3516182Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:54:12.3564902Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T08:54:12.3590671Z Entering 'third_party/cutlass' 2025-12-04T08:54:12.3648849Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T08:54:12.3649142Z Entering 'third_party/fbgemm' 2025-12-04T08:54:12.3660359Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T08:54:12.3673044Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:12.3700736Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T08:54:12.3736474Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:12.3786285Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T08:54:12.3803389Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:12.3841225Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T08:54:12.3852031Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:12.3944301Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T08:54:12.3994566Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:12.4019382Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T08:54:12.4033452Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:12.4053158Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T08:54:12.4061358Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:54:12.4146272Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T08:54:12.4146544Z Entering 'third_party/flash-attention' 2025-12-04T08:54:12.4169494Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T08:54:12.4179027Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:12.4222479Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T08:54:12.4236513Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:12.4277321Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T08:54:12.4301620Z Entering 'third_party/flatbuffers' 2025-12-04T08:54:12.4320243Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T08:54:12.4336076Z Entering 'third_party/fmt' 2025-12-04T08:54:12.4422887Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T08:54:12.4423129Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:12.4450445Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T08:54:12.4501023Z Entering 'third_party/gloo' 2025-12-04T08:54:12.4501225Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T08:54:12.4512811Z Entering 'third_party/googletest' 2025-12-04T08:54:12.4547307Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:12.4554767Z Entering 'third_party/ideep' 2025-12-04T08:54:12.4580802Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T08:54:12.4597845Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:12.4615128Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T08:54:12.4676220Z Entering 'third_party/ittapi' 2025-12-04T08:54:12.4683608Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T08:54:12.4756533Z Entering 'third_party/kineto' 2025-12-04T08:54:12.4756937Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T08:54:12.4766358Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:12.4811429Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T08:54:12.4876216Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:12.4888553Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T08:54:12.4901638Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:12.4933857Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T08:54:12.4945118Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:12.4966123Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T08:54:12.4978352Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:12.5009624Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T08:54:12.5024680Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:12.5067428Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T08:54:12.5084088Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:12.5110153Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T08:54:12.5123837Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:12.5139941Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:12.5151318Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:12.5206563Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T08:54:12.5256223Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:12.5259869Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T08:54:12.5271851Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:12.5316540Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T08:54:12.5316951Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:12.5361712Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T08:54:12.5362215Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:12.5384338Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T08:54:12.5403389Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:12.5425528Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T08:54:12.5436138Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:12.5496281Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T08:54:12.5499981Z Entering 'third_party/kleidiai' 2025-12-04T08:54:12.5517392Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T08:54:12.5552125Z Entering 'third_party/mimalloc' 2025-12-04T08:54:12.5606693Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T08:54:12.5619480Z Entering 'third_party/nlohmann' 2025-12-04T08:54:12.5649825Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T08:54:12.5660213Z Entering 'third_party/onnx' 2025-12-04T08:54:12.5683019Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T08:54:12.5703747Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:12.5727668Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:54:12.5766376Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:54:12.5786687Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T08:54:12.5828972Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:12.5855036Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:54:12.5896309Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:12.5926307Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:12.5932108Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:12.5957261Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T08:54:12.5967214Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:12.6010593Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T08:54:12.6025612Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:12.6047752Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T08:54:12.6086374Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:12.6088043Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T08:54:12.6093082Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:12.6135843Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T08:54:12.6167124Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:12.6178149Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T08:54:12.6198131Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:12.6230768Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T08:54:12.6257071Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:12.6309364Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T08:54:12.6340986Z Entering 'third_party/pocketfft' 2025-12-04T08:54:12.6356857Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T08:54:12.6376264Z Entering 'third_party/protobuf' 2025-12-04T08:54:12.6399156Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T08:54:12.6441916Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:12.6470755Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T08:54:12.6536295Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:12.6586416Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:12.6586685Z Entering 'third_party/psimd' 2025-12-04T08:54:12.6601088Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T08:54:12.6637154Z Entering 'third_party/pthreadpool' 2025-12-04T08:54:12.6670085Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T08:54:12.6682652Z Entering 'third_party/pybind11' 2025-12-04T08:54:12.6710092Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:54:12.6726577Z Entering 'third_party/python-peachpy' 2025-12-04T08:54:12.6747553Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T08:54:12.6760741Z Entering 'third_party/sleef' 2025-12-04T08:54:12.6782872Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T08:54:12.6791791Z Entering 'third_party/tensorpipe' 2025-12-04T08:54:12.6845849Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T08:54:12.6887223Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:12.6911935Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T08:54:12.6924469Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:12.6941063Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T08:54:12.6951996Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:12.6969953Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T08:54:12.6979084Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:12.7048279Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T08:54:12.7091501Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:12.7134416Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T08:54:12.7464148Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-12-04T08:54:12.7596187Z Entering 'android/libs/fbjni' 2025-12-04T08:54:12.7613506Z Entering 'third_party/FP16' 2025-12-04T08:54:12.7641290Z Entering 'third_party/FXdiv' 2025-12-04T08:54:12.7657556Z Entering 'third_party/NNPACK' 2025-12-04T08:54:12.7672790Z Entering 'third_party/NVTX' 2025-12-04T08:54:12.7688865Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:12.7704006Z Entering 'third_party/XNNPACK' 2025-12-04T08:54:12.7728866Z Entering 'third_party/aiter' 2025-12-04T08:54:12.7744163Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:12.7796648Z Entering 'third_party/benchmark' 2025-12-04T08:54:12.7814633Z Entering 'third_party/composable_kernel' 2025-12-04T08:54:12.7862411Z Entering 'third_party/cpp-httplib' 2025-12-04T08:54:12.7937761Z Entering 'third_party/cpuinfo' 2025-12-04T08:54:12.7971547Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:54:12.7993217Z Entering 'third_party/cutlass' 2025-12-04T08:54:12.8012619Z Entering 'third_party/fbgemm' 2025-12-04T08:54:12.8027341Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:12.8043093Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:12.8060954Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:12.8076391Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:12.8094575Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:12.8109108Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:12.8123348Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:54:12.8139600Z Entering 'third_party/flash-attention' 2025-12-04T08:54:12.8154532Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:12.8171924Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:12.8198797Z Entering 'third_party/flatbuffers' 2025-12-04T08:54:12.8219708Z Entering 'third_party/fmt' 2025-12-04T08:54:12.8236421Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:12.8254455Z Entering 'third_party/gloo' 2025-12-04T08:54:12.8271022Z Entering 'third_party/googletest' 2025-12-04T08:54:12.8293957Z Entering 'third_party/ideep' 2025-12-04T08:54:12.8307662Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:12.8325439Z Entering 'third_party/ittapi' 2025-12-04T08:54:12.8341632Z Entering 'third_party/kineto' 2025-12-04T08:54:12.8358241Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:12.8372952Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:12.8389546Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:12.8404346Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:12.8419079Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:12.8433875Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:12.8450171Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:12.8468938Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:12.8484239Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:12.8500886Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:12.8515475Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:12.8530516Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:12.8547074Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:12.8563476Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:12.8579165Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:12.8594835Z Entering 'third_party/kleidiai' 2025-12-04T08:54:12.8609131Z Entering 'third_party/mimalloc' 2025-12-04T08:54:12.8624911Z Entering 'third_party/nlohmann' 2025-12-04T08:54:12.8640889Z Entering 'third_party/onnx' 2025-12-04T08:54:12.8662837Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:12.8679932Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:54:12.8695875Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:12.8720362Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:12.8735547Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:12.8749872Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:12.8765202Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:12.8780289Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:12.8795810Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:12.8814479Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:12.8830044Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:12.8846598Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:12.8869620Z Entering 'third_party/pocketfft' 2025-12-04T08:54:12.8884797Z Entering 'third_party/protobuf' 2025-12-04T08:54:12.8910090Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:12.8925718Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:12.8944601Z Entering 'third_party/psimd' 2025-12-04T08:54:12.8958347Z Entering 'third_party/pthreadpool' 2025-12-04T08:54:12.8973149Z Entering 'third_party/pybind11' 2025-12-04T08:54:12.8986939Z Entering 'third_party/python-peachpy' 2025-12-04T08:54:12.9000521Z Entering 'third_party/sleef' 2025-12-04T08:54:12.9013894Z Entering 'third_party/tensorpipe' 2025-12-04T08:54:12.9027455Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:12.9043107Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:12.9057993Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:12.9073918Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:12.9088084Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:12.9114326Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-12-04T08:54:12.9227646Z Entering 'android/libs/fbjni' 2025-12-04T08:54:12.9241542Z Entering 'third_party/FP16' 2025-12-04T08:54:12.9254966Z Entering 'third_party/FXdiv' 2025-12-04T08:54:12.9268889Z Entering 'third_party/NNPACK' 2025-12-04T08:54:12.9283856Z Entering 'third_party/NVTX' 2025-12-04T08:54:12.9303295Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T08:54:12.9316910Z Entering 'third_party/XNNPACK' 2025-12-04T08:54:12.9336558Z Entering 'third_party/aiter' 2025-12-04T08:54:12.9353922Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T08:54:12.9373416Z Entering 'third_party/benchmark' 2025-12-04T08:54:12.9389632Z Entering 'third_party/composable_kernel' 2025-12-04T08:54:12.9411228Z Entering 'third_party/cpp-httplib' 2025-12-04T08:54:12.9427384Z Entering 'third_party/cpuinfo' 2025-12-04T08:54:12.9444753Z Entering 'third_party/cudnn_frontend' 2025-12-04T08:54:12.9459646Z Entering 'third_party/cutlass' 2025-12-04T08:54:12.9480739Z Entering 'third_party/fbgemm' 2025-12-04T08:54:12.9497656Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T08:54:12.9513889Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T08:54:12.9532624Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T08:54:12.9548340Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T08:54:12.9567499Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T08:54:12.9581530Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T08:54:12.9594492Z Entering 'third_party/fbgemm/external/json' 2025-12-04T08:54:12.9608940Z Entering 'third_party/flash-attention' 2025-12-04T08:54:12.9626266Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T08:54:12.9642235Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T08:54:12.9661732Z Entering 'third_party/flatbuffers' 2025-12-04T08:54:12.9676695Z Entering 'third_party/fmt' 2025-12-04T08:54:12.9692442Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T08:54:12.9706378Z Entering 'third_party/gloo' 2025-12-04T08:54:12.9720020Z Entering 'third_party/googletest' 2025-12-04T08:54:12.9738527Z Entering 'third_party/ideep' 2025-12-04T08:54:12.9753822Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T08:54:12.9772427Z Entering 'third_party/ittapi' 2025-12-04T08:54:12.9786398Z Entering 'third_party/kineto' 2025-12-04T08:54:12.9800026Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T08:54:12.9815678Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T08:54:12.9832234Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T08:54:12.9845729Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T08:54:12.9861581Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T08:54:12.9885903Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T08:54:12.9905705Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T08:54:12.9919601Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T08:54:12.9934580Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T08:54:12.9950259Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T08:54:12.9969346Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T08:54:12.9984478Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:13.0009953Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:13.0027353Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T08:54:13.0040771Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T08:54:13.0126398Z Entering 'third_party/kleidiai' 2025-12-04T08:54:13.0157355Z Entering 'third_party/mimalloc' 2025-12-04T08:54:13.0186691Z Entering 'third_party/nlohmann' 2025-12-04T08:54:13.0206676Z Entering 'third_party/onnx' 2025-12-04T08:54:13.0237375Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T08:54:13.0263109Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T08:54:13.0340849Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T08:54:13.0381697Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T08:54:13.0402950Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T08:54:13.0497474Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T08:54:13.0526737Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T08:54:13.0564228Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T08:54:13.0605393Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T08:54:13.0634148Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T08:54:13.0665531Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T08:54:13.0747835Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T08:54:13.0786469Z Entering 'third_party/pocketfft' 2025-12-04T08:54:13.0836338Z Entering 'third_party/protobuf' 2025-12-04T08:54:13.0917200Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T08:54:13.0942756Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T08:54:13.0971317Z Entering 'third_party/psimd' 2025-12-04T08:54:13.0994496Z Entering 'third_party/pthreadpool' 2025-12-04T08:54:13.1037624Z Entering 'third_party/pybind11' 2025-12-04T08:54:13.1056885Z Entering 'third_party/python-peachpy' 2025-12-04T08:54:13.1091316Z Entering 'third_party/sleef' 2025-12-04T08:54:13.1121346Z Entering 'third_party/tensorpipe' 2025-12-04T08:54:13.1152546Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T08:54:13.1179888Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T08:54:13.1208013Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T08:54:13.1286208Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T08:54:13.1286409Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T08:54:13.1353376Z ##[endgroup] 2025-12-04T08:54:13.1615581Z [command]/usr/bin/git log -1 --format=%H 2025-12-04T08:54:13.1827969Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:54:13.2098725Z Prepare all required actions 2025-12-04T08:54:13.2099038Z Getting action download info 2025-12-04T08:54:13.4693439Z Download action repository 'aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076' (SHA:062b18b96a7aff071d4dc91bc00c4c1a7945b076) 2025-12-04T08:54:14.4305664Z ##[group]Run ./.github/actions/setup-rocm 2025-12-04T08:54:14.4305791Z env: 2025-12-04T08:54:14.4305875Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.4306066Z ##[endgroup] 2025-12-04T08:54:14.4318547Z ##[group]Run dpkg -l | grep -E " rocm" 2025-12-04T08:54:14.4318702Z dpkg -l | grep -E " rocm" 2025-12-04T08:54:14.4323250Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:14.4323386Z env: 2025-12-04T08:54:14.4323467Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.4323564Z ##[endgroup] 2025-12-04T08:54:14.4388539Z ii rocm-cmake 0.14.0.60401-83~22.04 amd64 rocm-cmake built using CMake 2025-12-04T08:54:14.4389217Z ii rocm-core 6.4.1.60401-83~22.04 amd64 ROCm Runtime software stack 2025-12-04T08:54:14.4389464Z ii rocm-dbgapi 0.77.2.60401-83~22.04 amd64 Library to provide AMD GPU debugger API 2025-12-04T08:54:14.4389719Z ii rocm-debug-agent 2.0.4.60401-83~22.04 amd64 Radeon Open Compute Debug Agent (ROCdebug-agent) 2025-12-04T08:54:14.4389975Z ii rocm-dev 6.4.1.60401-83~22.04 amd64 Radeon Open Compute (ROCm) Runtime software stack 2025-12-04T08:54:14.4390222Z ii rocm-device-libs 1.0.0.60401-83~22.04 amd64 Radeon Open Compute - device libraries 2025-12-04T08:54:14.4390437Z ii rocm-gdb 15.2.60401-83~22.04 amd64 ROCgdb 2025-12-04T08:54:14.4390639Z ii rocm-llvm 19.0.0.25184.60401-83~22.04 amd64 ROCm core compiler 2025-12-04T08:54:14.4390859Z ii rocm-opencl 2.0.0.60401-83~22.04 amd64 clr built using CMake 2025-12-04T08:54:14.4391066Z ii rocm-opencl-dev 2.0.0.60401-83~22.04 amd64 clr built using CMake 2025-12-04T08:54:14.4391286Z ii rocm-smi-lib 7.5.0.60401-83~22.04 amd64 AMD System Management libraries 2025-12-04T08:54:14.4391517Z ii rocm-utils 6.4.1.60401-83~22.04 amd64 Radeon Open Compute (ROCm) Runtime software stack 2025-12-04T08:54:14.4391756Z ii rocminfo 1.0.0.60401-83~22.04 amd64 Radeon Open Compute (ROCm) Runtime rocminfo tool 2025-12-04T08:54:14.4446759Z ##[group]Run # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T08:54:14.4447097Z # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T08:54:14.4447308Z # shellcheck disable=SC2046 2025-12-04T08:54:14.4447457Z docker stop $(docker ps -q) || true 2025-12-04T08:54:14.4447594Z # Prune all stopped containers. 2025-12-04T08:54:14.4447728Z docker container prune -f 2025-12-04T08:54:14.4452118Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:14.4452276Z env: 2025-12-04T08:54:14.4452367Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.4452474Z ##[endgroup] 2025-12-04T08:54:14.4766638Z docker: 'docker stop' requires at least 1 argument 2025-12-04T08:54:14.4767334Z 2025-12-04T08:54:14.4767410Z Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...] 2025-12-04T08:54:14.4767512Z 2025-12-04T08:54:14.4767707Z See 'docker stop --help' for more information 2025-12-04T08:54:14.4859039Z Total reclaimed space: 0B 2025-12-04T08:54:14.4895807Z ##[group]Run cat /etc/os-release || true 2025-12-04T08:54:14.4896082Z cat /etc/os-release || true 2025-12-04T08:54:14.4896239Z cat /etc/apt/sources.list.d/rocm.list || true 2025-12-04T08:54:14.4896560Z cat /opt/rocm/.info/version || true 2025-12-04T08:54:14.4896678Z whoami 2025-12-04T08:54:14.4900946Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:14.4901097Z env: 2025-12-04T08:54:14.4901190Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.4901295Z ##[endgroup] 2025-12-04T08:54:14.4963744Z PRETTY_NAME="Ubuntu 22.04.5 LTS" 2025-12-04T08:54:14.4963860Z NAME="Ubuntu" 2025-12-04T08:54:14.4963960Z VERSION_ID="22.04" 2025-12-04T08:54:14.4964076Z VERSION="22.04.5 LTS (Jammy Jellyfish)" 2025-12-04T08:54:14.4964201Z VERSION_CODENAME=jammy 2025-12-04T08:54:14.4964303Z ID=ubuntu 2025-12-04T08:54:14.4964385Z ID_LIKE=debian 2025-12-04T08:54:14.4964511Z HOME_URL="https://www.ubuntu.com/" 2025-12-04T08:54:14.4964645Z SUPPORT_URL="https://help.ubuntu.com/" 2025-12-04T08:54:14.4964793Z BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" 2025-12-04T08:54:14.4965009Z PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" 2025-12-04T08:54:14.4965372Z UBUNTU_CODENAME=jammy 2025-12-04T08:54:14.4995240Z deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/6.4.1 jammy main 2025-12-04T08:54:14.5029282Z 6.4.1-83 2025-12-04T08:54:14.5035289Z runner 2025-12-04T08:54:14.5055579Z ##[group]Run dpkg -l | grep -E " amdgpu" 2025-12-04T08:54:14.5055743Z dpkg -l | grep -E " amdgpu" 2025-12-04T08:54:14.5059144Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:14.5059292Z env: 2025-12-04T08:54:14.5059381Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.5059494Z ##[endgroup] 2025-12-04T08:54:14.5115229Z ii amdgpu-core 1:6.4.60401-2164967.22.04 all Core meta package for unified amdgpu driver. 2025-12-04T08:54:14.5116448Z ii amdgpu-install 6.4.60401-2164967.22.04 all AMDGPU driver repository and installer 2025-12-04T08:54:14.5163412Z ##[group]Run rocm-smi 2025-12-04T08:54:14.5163550Z rocm-smi 2025-12-04T08:54:14.5167144Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:14.5167294Z env: 2025-12-04T08:54:14.5167406Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.5167516Z ##[endgroup] 2025-12-04T08:54:14.5754338Z 2025-12-04T08:54:14.5754349Z 2025-12-04T08:54:14.5754509Z =========================================== ROCm System Management Interface =========================================== 2025-12-04T08:54:14.5754727Z ===================================================== Concise Info ===================================================== 2025-12-04T08:54:14.5754945Z Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% 2025-12-04T08:54:14.5755490Z  (DID, GUID) (Junction) (Socket) (Mem, Compute, ID)  2025-12-04T08:54:14.5755681Z ======================================================================================================================== 2025-12-04T08:54:14.5756492Z 0 4 0x74a5, 61326 27.0°C 132.0W NPS1, SPX, 0 N/A 900Mhz 0% auto 1000.0W 0% 0% 2025-12-04T08:54:14.5756680Z ======================================================================================================================== 2025-12-04T08:54:14.5756841Z ================================================= End of ROCm SMI Log ================================================== 2025-12-04T08:54:14.5803394Z ##[group]Run rocminfo 2025-12-04T08:54:14.5803530Z rocminfo 2025-12-04T08:54:14.5807776Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:14.5807925Z env: 2025-12-04T08:54:14.5808011Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.5808114Z ##[endgroup] 2025-12-04T08:54:14.6422672Z ROCk module version 6.12.12 is loaded 2025-12-04T08:54:14.6424714Z ===================== 2025-12-04T08:54:14.6424867Z HSA System Attributes 2025-12-04T08:54:14.6424962Z ===================== 2025-12-04T08:54:14.6425294Z Runtime Version: 1.15 2025-12-04T08:54:14.6425405Z Runtime Ext Version: 1.7 2025-12-04T08:54:14.6425513Z System Timestamp Freq.: 1000.000000MHz 2025-12-04T08:54:14.6425773Z Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) 2025-12-04T08:54:14.6425991Z Machine Model: LARGE 2025-12-04T08:54:14.6426142Z System Endianness: LITTLE 2025-12-04T08:54:14.6426277Z Mwaitx: DISABLED 2025-12-04T08:54:14.6426385Z XNACK enabled: NO 2025-12-04T08:54:14.6426490Z DMAbuf Support: YES 2025-12-04T08:54:14.6426598Z VMM Support: YES 2025-12-04T08:54:14.6426664Z 2025-12-04T08:54:14.6426698Z ========== 2025-12-04T08:54:14.6426794Z HSA Agents 2025-12-04T08:54:14.6426883Z ========== 2025-12-04T08:54:14.6426974Z ******* 2025-12-04T08:54:14.6427196Z Agent 1 2025-12-04T08:54:14.6427289Z ******* 2025-12-04T08:54:14.6427407Z Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T08:54:14.6429110Z Uuid: CPU-XX 2025-12-04T08:54:14.6429257Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T08:54:14.6429409Z Vendor Name: CPU 2025-12-04T08:54:14.6429551Z Feature: None specified 2025-12-04T08:54:14.6429696Z Profile: FULL_PROFILE 2025-12-04T08:54:14.6429842Z Float Round Mode: NEAR 2025-12-04T08:54:14.6429989Z Max Queue Number: 0(0x0) 2025-12-04T08:54:14.6430132Z Queue Min Size: 0(0x0) 2025-12-04T08:54:14.6430276Z Queue Max Size: 0(0x0) 2025-12-04T08:54:14.6430420Z Queue Type: MULTI 2025-12-04T08:54:14.6430555Z Node: 0 2025-12-04T08:54:14.6430688Z Device Type: CPU 2025-12-04T08:54:14.6430818Z Cache Info: 2025-12-04T08:54:14.6430929Z L1: 49152(0xc000) KB 2025-12-04T08:54:14.6431062Z Chip ID: 0(0x0) 2025-12-04T08:54:14.6431200Z ASIC Revision: 0(0x0) 2025-12-04T08:54:14.6431347Z Cacheline Size: 64(0x40) 2025-12-04T08:54:14.6431493Z Max Clock Freq. (MHz): 3300 2025-12-04T08:54:14.6431630Z BDFID: 0 2025-12-04T08:54:14.6431772Z Internal Node ID: 0 2025-12-04T08:54:14.6431918Z Compute Unit: 64 2025-12-04T08:54:14.6432061Z SIMDs per CU: 0 2025-12-04T08:54:14.6432201Z Shader Engines: 0 2025-12-04T08:54:14.6432353Z Shader Arrs. per Eng.: 0 2025-12-04T08:54:14.6432507Z WatchPts on Addr. Ranges:1 2025-12-04T08:54:14.6432640Z Memory Properties: 2025-12-04T08:54:14.6432746Z Features: None 2025-12-04T08:54:14.6432848Z Pool Info: 2025-12-04T08:54:14.6432946Z Pool 1 2025-12-04T08:54:14.6433075Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T08:54:14.6433220Z Size: 1584777168(0x5e75c7d0) KB 2025-12-04T08:54:14.6433365Z Allocatable: TRUE 2025-12-04T08:54:14.6433514Z Alloc Granule: 4KB 2025-12-04T08:54:14.6433730Z Alloc Recommended Granule:4KB 2025-12-04T08:54:14.6433891Z Alloc Alignment: 4KB 2025-12-04T08:54:14.6434043Z Accessible by all: TRUE 2025-12-04T08:54:14.6434171Z Pool 2 2025-12-04T08:54:14.6434297Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T08:54:14.6434438Z Size: 1584777168(0x5e75c7d0) KB 2025-12-04T08:54:14.6434580Z Allocatable: TRUE 2025-12-04T08:54:14.6434728Z Alloc Granule: 4KB 2025-12-04T08:54:14.6434881Z Alloc Recommended Granule:4KB 2025-12-04T08:54:14.6435036Z Alloc Alignment: 4KB 2025-12-04T08:54:14.6435189Z Accessible by all: TRUE 2025-12-04T08:54:14.6435351Z Pool 3 2025-12-04T08:54:14.6435478Z Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED 2025-12-04T08:54:14.6435620Z Size: 1584777168(0x5e75c7d0) KB 2025-12-04T08:54:14.6435761Z Allocatable: TRUE 2025-12-04T08:54:14.6435908Z Alloc Granule: 4KB 2025-12-04T08:54:14.6436103Z Alloc Recommended Granule:4KB 2025-12-04T08:54:14.6436259Z Alloc Alignment: 4KB 2025-12-04T08:54:14.6436410Z Accessible by all: TRUE 2025-12-04T08:54:14.6436540Z Pool 4 2025-12-04T08:54:14.6436661Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T08:54:14.6436800Z Size: 1584777168(0x5e75c7d0) KB 2025-12-04T08:54:14.6436941Z Allocatable: TRUE 2025-12-04T08:54:14.6437090Z Alloc Granule: 4KB 2025-12-04T08:54:14.6437242Z Alloc Recommended Granule:4KB 2025-12-04T08:54:14.6437396Z Alloc Alignment: 4KB 2025-12-04T08:54:14.6437548Z Accessible by all: TRUE 2025-12-04T08:54:14.6437677Z ISA Info: 2025-12-04T08:54:14.6437773Z ******* 2025-12-04T08:54:14.6437864Z Agent 2 2025-12-04T08:54:14.6437955Z ******* 2025-12-04T08:54:14.6438067Z Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T08:54:14.6438253Z Uuid: CPU-XX 2025-12-04T08:54:14.6438399Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T08:54:14.6438551Z Vendor Name: CPU 2025-12-04T08:54:14.6438698Z Feature: None specified 2025-12-04T08:54:14.6438842Z Profile: FULL_PROFILE 2025-12-04T08:54:14.6438986Z Float Round Mode: NEAR 2025-12-04T08:54:14.6439135Z Max Queue Number: 0(0x0) 2025-12-04T08:54:14.6439278Z Queue Min Size: 0(0x0) 2025-12-04T08:54:14.6439418Z Queue Max Size: 0(0x0) 2025-12-04T08:54:14.6439562Z Queue Type: MULTI 2025-12-04T08:54:14.6439699Z Node: 1 2025-12-04T08:54:14.6439836Z Device Type: CPU 2025-12-04T08:54:14.6439962Z Cache Info: 2025-12-04T08:54:14.6440072Z L1: 49152(0xc000) KB 2025-12-04T08:54:14.6440203Z Chip ID: 0(0x0) 2025-12-04T08:54:14.6440412Z ASIC Revision: 0(0x0) 2025-12-04T08:54:14.6440557Z Cacheline Size: 64(0x40) 2025-12-04T08:54:14.6440706Z Max Clock Freq. (MHz): 3300 2025-12-04T08:54:14.6440861Z BDFID: 0 2025-12-04T08:54:14.6441001Z Internal Node ID: 1 2025-12-04T08:54:14.6441144Z Compute Unit: 64 2025-12-04T08:54:14.6441284Z SIMDs per CU: 0 2025-12-04T08:54:14.6441424Z Shader Engines: 0 2025-12-04T08:54:14.6441572Z Shader Arrs. per Eng.: 0 2025-12-04T08:54:14.6441723Z WatchPts on Addr. Ranges:1 2025-12-04T08:54:14.6441894Z Memory Properties: 2025-12-04T08:54:14.6441999Z Features: None 2025-12-04T08:54:14.6442099Z Pool Info: 2025-12-04T08:54:14.6442199Z Pool 1 2025-12-04T08:54:14.6442323Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T08:54:14.6442462Z Size: 1585311812(0x5e7df044) KB 2025-12-04T08:54:14.6442605Z Allocatable: TRUE 2025-12-04T08:54:14.6442755Z Alloc Granule: 4KB 2025-12-04T08:54:14.6442911Z Alloc Recommended Granule:4KB 2025-12-04T08:54:14.6443069Z Alloc Alignment: 4KB 2025-12-04T08:54:14.6443226Z Accessible by all: TRUE 2025-12-04T08:54:14.6443359Z Pool 2 2025-12-04T08:54:14.6443488Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T08:54:14.6443634Z Size: 1585311812(0x5e7df044) KB 2025-12-04T08:54:14.6443779Z Allocatable: TRUE 2025-12-04T08:54:14.6443933Z Alloc Granule: 4KB 2025-12-04T08:54:14.6444087Z Alloc Recommended Granule:4KB 2025-12-04T08:54:14.6444248Z Alloc Alignment: 4KB 2025-12-04T08:54:14.6444408Z Accessible by all: TRUE 2025-12-04T08:54:14.6444538Z Pool 3 2025-12-04T08:54:14.6444665Z Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED 2025-12-04T08:54:14.6444807Z Size: 1585311812(0x5e7df044) KB 2025-12-04T08:54:14.6444952Z Allocatable: TRUE 2025-12-04T08:54:14.6445105Z Alloc Granule: 4KB 2025-12-04T08:54:14.6445266Z Alloc Recommended Granule:4KB 2025-12-04T08:54:14.6445426Z Alloc Alignment: 4KB 2025-12-04T08:54:14.6445584Z Accessible by all: TRUE 2025-12-04T08:54:14.6445716Z Pool 4 2025-12-04T08:54:14.6445842Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T08:54:14.6446159Z Size: 1585311812(0x5e7df044) KB 2025-12-04T08:54:14.6446304Z Allocatable: TRUE 2025-12-04T08:54:14.6446455Z Alloc Granule: 4KB 2025-12-04T08:54:14.6446608Z Alloc Recommended Granule:4KB 2025-12-04T08:54:14.6446768Z Alloc Alignment: 4KB 2025-12-04T08:54:14.6446923Z Accessible by all: TRUE 2025-12-04T08:54:14.6447057Z ISA Info: 2025-12-04T08:54:14.6447213Z ******* 2025-12-04T08:54:14.6447309Z Agent 3 2025-12-04T08:54:14.6447406Z ******* 2025-12-04T08:54:14.6447518Z Name: gfx942 2025-12-04T08:54:14.6447657Z Uuid: GPU-01ff8763ec76c341 2025-12-04T08:54:14.6447809Z Marketing Name: AMD Instinct MI325X 2025-12-04T08:54:14.6447966Z Vendor Name: AMD 2025-12-04T08:54:14.6448112Z Feature: KERNEL_DISPATCH 2025-12-04T08:54:14.6448262Z Profile: BASE_PROFILE 2025-12-04T08:54:14.6448409Z Float Round Mode: NEAR 2025-12-04T08:54:14.6448563Z Max Queue Number: 128(0x80) 2025-12-04T08:54:14.6448765Z Queue Min Size: 64(0x40) 2025-12-04T08:54:14.6448960Z Queue Max Size: 131072(0x20000) 2025-12-04T08:54:14.6449110Z Queue Type: MULTI 2025-12-04T08:54:14.6449252Z Node: 2 2025-12-04T08:54:14.6449389Z Device Type: GPU 2025-12-04T08:54:14.6449523Z Cache Info: 2025-12-04T08:54:14.6449637Z L1: 32(0x20) KB 2025-12-04T08:54:14.6449765Z L2: 4096(0x1000) KB 2025-12-04T08:54:14.6449896Z L3: 262144(0x40000) KB 2025-12-04T08:54:14.6450026Z Chip ID: 29861(0x74a5) 2025-12-04T08:54:14.6450172Z ASIC Revision: 1(0x1) 2025-12-04T08:54:14.6450323Z Cacheline Size: 128(0x80) 2025-12-04T08:54:14.6450474Z Max Clock Freq. (MHz): 2100 2025-12-04T08:54:14.6450620Z BDFID: 25856 2025-12-04T08:54:14.6450766Z Internal Node ID: 2 2025-12-04T08:54:14.6450913Z Compute Unit: 304 2025-12-04T08:54:14.6451059Z SIMDs per CU: 4 2025-12-04T08:54:14.6451206Z Shader Engines: 32 2025-12-04T08:54:14.6451360Z Shader Arrs. per Eng.: 1 2025-12-04T08:54:14.6451519Z WatchPts on Addr. Ranges:4 2025-12-04T08:54:14.6451675Z Coherent Host Access: FALSE 2025-12-04T08:54:14.6451815Z Memory Properties: 2025-12-04T08:54:14.6451932Z Features: KERNEL_DISPATCH 2025-12-04T08:54:14.6452073Z Fast F16 Operation: TRUE 2025-12-04T08:54:14.6452235Z Wavefront Size: 64(0x40) 2025-12-04T08:54:14.6452385Z Workgroup Max Size: 1024(0x400) 2025-12-04T08:54:14.6452531Z Workgroup Max Size per Dimension: 2025-12-04T08:54:14.6452659Z x 1024(0x400) 2025-12-04T08:54:14.6452783Z y 1024(0x400) 2025-12-04T08:54:14.6452910Z z 1024(0x400) 2025-12-04T08:54:14.6453049Z Max Waves Per CU: 32(0x20) 2025-12-04T08:54:14.6453199Z Max Work-item Per CU: 2048(0x800) 2025-12-04T08:54:14.6453351Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T08:54:14.6453481Z Grid Max Size per Dimension: 2025-12-04T08:54:14.6453597Z x 4294967295(0xffffffff) 2025-12-04T08:54:14.6453725Z y 4294967295(0xffffffff) 2025-12-04T08:54:14.6453881Z z 4294967295(0xffffffff) 2025-12-04T08:54:14.6454028Z Max fbarriers/Workgrp: 32 2025-12-04T08:54:14.6459623Z Packet Processor uCode:: 185 2025-12-04T08:54:14.6459789Z SDMA engine uCode:: 24 2025-12-04T08:54:14.6459948Z IOMMU Support:: None 2025-12-04T08:54:14.6460087Z Pool Info: 2025-12-04T08:54:14.6460191Z Pool 1 2025-12-04T08:54:14.6460326Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T08:54:14.6460475Z Size: 268419072(0xfffc000) KB 2025-12-04T08:54:14.6460629Z Allocatable: TRUE 2025-12-04T08:54:14.6460786Z Alloc Granule: 4KB 2025-12-04T08:54:14.6461031Z Alloc Recommended Granule:2048KB 2025-12-04T08:54:14.6461201Z Alloc Alignment: 4KB 2025-12-04T08:54:14.6461362Z Accessible by all: FALSE 2025-12-04T08:54:14.6461497Z Pool 2 2025-12-04T08:54:14.6461631Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T08:54:14.6461777Z Size: 268419072(0xfffc000) KB 2025-12-04T08:54:14.6461927Z Allocatable: TRUE 2025-12-04T08:54:14.6462082Z Alloc Granule: 4KB 2025-12-04T08:54:14.6462239Z Alloc Recommended Granule:2048KB 2025-12-04T08:54:14.6462400Z Alloc Alignment: 4KB 2025-12-04T08:54:14.6462579Z Accessible by all: FALSE 2025-12-04T08:54:14.6462715Z Pool 3 2025-12-04T08:54:14.6462849Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T08:54:14.6462996Z Size: 268419072(0xfffc000) KB 2025-12-04T08:54:14.6463138Z Allocatable: TRUE 2025-12-04T08:54:14.6463294Z Alloc Granule: 4KB 2025-12-04T08:54:14.6463456Z Alloc Recommended Granule:2048KB 2025-12-04T08:54:14.6463613Z Alloc Alignment: 4KB 2025-12-04T08:54:14.6463773Z Accessible by all: FALSE 2025-12-04T08:54:14.6463904Z Pool 4 2025-12-04T08:54:14.6464027Z Segment: GROUP 2025-12-04T08:54:14.6464170Z Size: 64(0x40) KB 2025-12-04T08:54:14.6464313Z Allocatable: FALSE 2025-12-04T08:54:14.6464471Z Alloc Granule: 0KB 2025-12-04T08:54:14.6464634Z Alloc Recommended Granule:0KB 2025-12-04T08:54:14.6464791Z Alloc Alignment: 0KB 2025-12-04T08:54:14.6464950Z Accessible by all: FALSE 2025-12-04T08:54:14.6465089Z ISA Info: 2025-12-04T08:54:14.6465190Z ISA 1 2025-12-04T08:54:14.6465323Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2025-12-04T08:54:14.6465488Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T08:54:14.6465650Z Profiles: HSA_PROFILE_BASE 2025-12-04T08:54:14.6465809Z Default Rounding Mode: NEAR 2025-12-04T08:54:14.6465985Z Default Rounding Mode: NEAR 2025-12-04T08:54:14.6466188Z Fast f16: TRUE 2025-12-04T08:54:14.6466344Z Workgroup Max Size: 1024(0x400) 2025-12-04T08:54:14.6466487Z Workgroup Max Size per Dimension: 2025-12-04T08:54:14.6466619Z x 1024(0x400) 2025-12-04T08:54:14.6466749Z y 1024(0x400) 2025-12-04T08:54:14.6466878Z z 1024(0x400) 2025-12-04T08:54:14.6467020Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T08:54:14.6467155Z Grid Max Size per Dimension: 2025-12-04T08:54:14.6467279Z x 4294967295(0xffffffff) 2025-12-04T08:54:14.6467412Z y 4294967295(0xffffffff) 2025-12-04T08:54:14.6467540Z z 4294967295(0xffffffff) 2025-12-04T08:54:14.6467734Z FBarrier Max Size: 32 2025-12-04T08:54:14.6467868Z ISA 2 2025-12-04T08:54:14.6468010Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2025-12-04T08:54:14.6468185Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T08:54:14.6468342Z Profiles: HSA_PROFILE_BASE 2025-12-04T08:54:14.6468501Z Default Rounding Mode: NEAR 2025-12-04T08:54:14.6468664Z Default Rounding Mode: NEAR 2025-12-04T08:54:14.6468813Z Fast f16: TRUE 2025-12-04T08:54:14.6468965Z Workgroup Max Size: 1024(0x400) 2025-12-04T08:54:14.6469111Z Workgroup Max Size per Dimension: 2025-12-04T08:54:14.6469235Z x 1024(0x400) 2025-12-04T08:54:14.6469366Z y 1024(0x400) 2025-12-04T08:54:14.6469491Z z 1024(0x400) 2025-12-04T08:54:14.6469634Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T08:54:14.6469820Z Grid Max Size per Dimension: 2025-12-04T08:54:14.6469937Z x 4294967295(0xffffffff) 2025-12-04T08:54:14.6470070Z y 4294967295(0xffffffff) 2025-12-04T08:54:14.6470201Z z 4294967295(0xffffffff) 2025-12-04T08:54:14.6470341Z FBarrier Max Size: 32 2025-12-04T08:54:14.6470476Z *** Done *** 2025-12-04T08:54:14.6537464Z ##[group]Run ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx') 2025-12-04T08:54:14.6537691Z ngpu=$(rocminfo | grep -c -E 'Name:.*\sgfx') 2025-12-04T08:54:14.6537987Z msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified" 2025-12-04T08:54:14.6538267Z if [[ $ngpu -eq 0 ]]; then 2025-12-04T08:54:14.6538419Z  echo "Error: Failed to detect any GPUs on the runner" 2025-12-04T08:54:14.6538570Z  echo "$msg" 2025-12-04T08:54:14.6538670Z  exit 1 2025-12-04T08:54:14.6538770Z fi 2025-12-04T08:54:14.6543288Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:14.6543447Z env: 2025-12-04T08:54:14.6543552Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.6543660Z ##[endgroup] 2025-12-04T08:54:14.7256873Z ##[group]Run pytorch/pytorch/.github/actions/diskspace-cleanup@main 2025-12-04T08:54:14.7257034Z with: 2025-12-04T08:54:14.7257124Z diskspace-cutoff: 70 2025-12-04T08:54:14.7257219Z env: 2025-12-04T08:54:14.7257305Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.7257407Z ##[endgroup] 2025-12-04T08:54:14.7279830Z ##[group]Run set -ex 2025-12-04T08:54:14.7279975Z set -ex 2025-12-04T08:54:14.7280074Z diskspace_cutoff=70 2025-12-04T08:54:14.7280302Z docker_root_dir=$(docker info -f '{{.DockerRootDir}}') 2025-12-04T08:54:14.7280456Z if [ ! -d "$docker_root_dir" ]; then 2025-12-04T08:54:14.7280651Z  echo "Docker root directory ($docker_root_dir) does not exist. Skipping disk space check." 2025-12-04T08:54:14.7280831Z  exit 0 2025-12-04T08:54:14.7280916Z fi 2025-12-04T08:54:14.7281075Z diskspace=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //') 2025-12-04T08:54:14.7281394Z msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified" 2025-12-04T08:54:14.7281670Z if [[ "$diskspace" -ge "$diskspace_cutoff" ]] ; then 2025-12-04T08:54:14.7281816Z  docker system prune -af 2025-12-04T08:54:14.7281997Z  diskspace_new=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //') 2025-12-04T08:54:14.7282258Z  if [[ "$diskspace_new" -gt "$diskspace_cutoff" ]] ; then 2025-12-04T08:54:14.7282418Z  diskspace_cutoff_int=$((diskspace_cutoff + 0)) 2025-12-04T08:54:14.7282568Z  difference=$((100 - diskspace_cutoff_int)) 2025-12-04T08:54:14.7282768Z  echo "Error: Available diskspace is less than $difference percent. Not enough diskspace." 2025-12-04T08:54:14.7282950Z  echo "$msg" 2025-12-04T08:54:14.7283051Z  exit 1 2025-12-04T08:54:14.7283145Z  else 2025-12-04T08:54:14.7283255Z  difference=$((diskspace - diskspace_new)) 2025-12-04T08:54:14.7283404Z  echo "Diskspace saved: $difference percent" 2025-12-04T08:54:14.7283530Z  fi 2025-12-04T08:54:14.7283616Z fi 2025-12-04T08:54:14.7287711Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:14.7287854Z env: 2025-12-04T08:54:14.7287944Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.7288045Z ##[endgroup] 2025-12-04T08:54:14.7310647Z + diskspace_cutoff=70 2025-12-04T08:54:14.7310880Z ++ docker info -f '{{.DockerRootDir}}' 2025-12-04T08:54:14.7858064Z + docker_root_dir=/home/runner/docker-data 2025-12-04T08:54:14.7858647Z + '[' '!' -d /home/runner/docker-data ']' 2025-12-04T08:54:14.7858845Z ++ df -H --output=pcent /home/runner/docker-data 2025-12-04T08:54:14.7862154Z ++ sed -n 2p 2025-12-04T08:54:14.7862395Z ++ sed s/%// 2025-12-04T08:54:14.7862501Z ++ sed 's/ //' 2025-12-04T08:54:14.7874375Z + diskspace=' 5' 2025-12-04T08:54:14.7874652Z + msg='Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified' 2025-12-04T08:54:14.7874907Z + [[ 5 -ge 70 ]] 2025-12-04T08:54:14.7897918Z ##[group]Run RUNNER_ARTIFACT_DIR="${RUNNER_TEMP}/artifacts" 2025-12-04T08:54:14.7898155Z RUNNER_ARTIFACT_DIR="${RUNNER_TEMP}/artifacts" 2025-12-04T08:54:14.7898339Z rm -rf "${RUNNER_ARTIFACT_DIR}" 2025-12-04T08:54:14.7898481Z mkdir -p "${RUNNER_ARTIFACT_DIR}" 2025-12-04T08:54:14.7898652Z echo "RUNNER_ARTIFACT_DIR=${RUNNER_ARTIFACT_DIR}" >> "${GITHUB_ENV}" 2025-12-04T08:54:14.7898808Z  2025-12-04T08:54:14.7898925Z RUNNER_TEST_RESULTS_DIR="${RUNNER_TEMP}/test-results" 2025-12-04T08:54:14.7899079Z rm -rf "${RUNNER_TEST_RESULTS_DIR}" 2025-12-04T08:54:14.7899207Z mkdir -p "${RUNNER_TEST_RESULTS_DIR}" 2025-12-04T08:54:14.7899380Z echo "RUNNER_TEST_RESULTS_DIR=${RUNNER_TEST_RESULTS_DIR}" >> "${GITHUB_ENV}" 2025-12-04T08:54:14.7899538Z  2025-12-04T08:54:14.7899630Z RUNNER_DOCS_DIR="${RUNNER_TEMP}/docs" 2025-12-04T08:54:14.7899757Z rm -rf "${RUNNER_DOCS_DIR}" 2025-12-04T08:54:14.7899872Z mkdir -p "${RUNNER_DOCS_DIR}" 2025-12-04T08:54:14.7900021Z echo "RUNNER_DOCS_DIR=${RUNNER_DOCS_DIR}" >> "${GITHUB_ENV}" 2025-12-04T08:54:14.7904725Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:14.7904872Z env: 2025-12-04T08:54:14.7904960Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.7905058Z ##[endgroup] 2025-12-04T08:54:14.8037772Z ##[group]Run env | grep '^GITHUB' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T08:54:14.8037995Z env | grep '^GITHUB' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T08:54:14.8038180Z env | grep '^CI' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T08:54:14.8041588Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:14.8041740Z env: 2025-12-04T08:54:14.8041840Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.8041978Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T08:54:14.8042152Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T08:54:14.8042318Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T08:54:14.8042600Z ##[endgroup] 2025-12-04T08:54:14.8091266Z ##[group]Run # All GPUs are visible to the runner; visibility, if needed, will be set by run_test.py. 2025-12-04T08:54:14.8091557Z # All GPUs are visible to the runner; visibility, if needed, will be set by run_test.py. 2025-12-04T08:54:14.8091752Z # Add render group for container creation. 2025-12-04T08:54:14.8091918Z render_gid=`cat /etc/group | grep render | cut -d: -f3` 2025-12-04T08:54:14.8092113Z # Ensure GPU isolation if pod is part of kubernetes setup with DEVICE_FLAG. 2025-12-04T08:54:14.8092309Z if [ -f "/etc/podinfo/gha-render-devices" ]; then 2025-12-04T08:54:14.8092467Z  DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices) 2025-12-04T08:54:14.8092602Z else 2025-12-04T08:54:14.8092700Z  DEVICE_FLAG="--device /dev/dri" 2025-12-04T08:54:14.8092810Z fi 2025-12-04T08:54:14.8092987Z # The --group-add daemon and --group-add bin are needed in the Ubuntu 24.04 and Almalinux OSs respectively. 2025-12-04T08:54:14.8093279Z # This is due to the device files (/dev/kfd & /dev/dri) being owned by video group on bare metal. 2025-12-04T08:54:14.8093526Z # This video group ID maps to subgid 1 inside the docker image due to the /etc/subgid entries. 2025-12-04T08:54:14.8093788Z # The group name corresponding to group ID 1 can change depending on the OS, so both are necessary. 2025-12-04T08:54:14.8094227Z echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd $DEVICE_FLAG --group-add video --group-add $render_gid --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host" >> "${GITHUB_ENV}" 2025-12-04T08:54:14.8097542Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:14.8097679Z env: 2025-12-04T08:54:14.8097769Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.8097900Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T08:54:14.8098074Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T08:54:14.8098237Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T08:54:14.8098360Z ##[endgroup] 2025-12-04T08:54:14.8215265Z ##[group]Run aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 2025-12-04T08:54:14.8215491Z with: 2025-12-04T08:54:14.8215640Z role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only 2025-12-04T08:54:14.8215810Z aws-region: us-east-1 2025-12-04T08:54:14.8216212Z role-duration-seconds: 18000 2025-12-04T08:54:14.8216341Z audience: sts.amazonaws.com 2025-12-04T08:54:14.8216456Z env: 2025-12-04T08:54:14.8216551Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:14.8216686Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T08:54:14.8216869Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T08:54:14.8217040Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T08:54:14.8217574Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T08:54:14.8217958Z ##[endgroup] 2025-12-04T08:54:15.1520245Z Assuming role with OIDC 2025-12-04T08:54:15.4937639Z Authenticated as assumedRoleId AROAUPVRELQNLLCOPFEJR:GitHubActions 2025-12-04T08:54:15.5981797Z ##[group]Run aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 2025-12-04T08:54:15.5981997Z with: 2025-12-04T08:54:15.5982096Z mask-password: true 2025-12-04T08:54:15.5982205Z registry-type: private 2025-12-04T08:54:15.5982312Z skip-logout: false 2025-12-04T08:54:15.5982409Z env: 2025-12-04T08:54:15.5982502Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:15.5982637Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T08:54:15.5982813Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T08:54:15.5982978Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T08:54:15.5983497Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T08:54:15.5983871Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T08:54:15.5983986Z AWS_REGION: us-east-1 2025-12-04T08:54:15.5984360Z AWS_ACCESS_KEY_ID: *** 2025-12-04T08:54:15.5984514Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T08:54:15.5986658Z AWS_SESSION_TOKEN: *** 2025-12-04T08:54:15.5986761Z ##[endgroup] 2025-12-04T08:54:16.0085355Z Logging into registry 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:16.6596064Z ##[group]Run env | grep '^GITHUB' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T08:54:16.6596311Z env | grep '^GITHUB' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T08:54:16.6596504Z env | grep '^CI' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T08:54:16.6596717Z env | grep '^RUNNER' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-12-04T08:54:16.6601339Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:16.6601490Z env: 2025-12-04T08:54:16.6601587Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:16.6601727Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T08:54:16.6601908Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T08:54:16.6602081Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T08:54:16.6602467Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T08:54:16.6602842Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T08:54:16.6602962Z AWS_REGION: us-east-1 2025-12-04T08:54:16.6603157Z AWS_ACCESS_KEY_ID: *** 2025-12-04T08:54:16.6603312Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T08:54:16.6605419Z AWS_SESSION_TOKEN: *** 2025-12-04T08:54:16.6605528Z ##[endgroup] 2025-12-04T08:54:16.6788642Z ##[group]Run pytorch/test-infra/.github/actions/calculate-docker-image@main 2025-12-04T08:54:16.6788837Z with: 2025-12-04T08:54:16.6789121Z docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:16.6789430Z use-custom-docker-registry: true 2025-12-04T08:54:16.6789564Z docker-build-dir: .ci/docker 2025-12-04T08:54:16.6789692Z docker-build-script: ./build.sh 2025-12-04T08:54:16.6789823Z working-directory: . 2025-12-04T08:54:16.6789971Z docker-registry: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:16.6790134Z force-push: false 2025-12-04T08:54:16.6790236Z env: 2025-12-04T08:54:16.6790334Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:16.6790480Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T08:54:16.6790682Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T08:54:16.6790870Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T08:54:16.6791262Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T08:54:16.6791641Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T08:54:16.6791761Z AWS_REGION: us-east-1 2025-12-04T08:54:16.6792001Z AWS_ACCESS_KEY_ID: *** 2025-12-04T08:54:16.6792157Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T08:54:16.6794233Z AWS_SESSION_TOKEN: *** 2025-12-04T08:54:16.6794342Z ##[endgroup] 2025-12-04T08:54:16.6803611Z ##[group]Run set -ex 2025-12-04T08:54:16.6803763Z set -ex 2025-12-04T08:54:16.6803864Z  2025-12-04T08:54:16.6804027Z # If the docker build directory or the build script doesn't exist, the action will 2025-12-04T08:54:16.6804414Z # gracefully return the docker image name as it is. Pulling docker image in Linux 2025-12-04T08:54:16.6804637Z # job could then download the pre-built image as usual 2025-12-04T08:54:16.6804904Z if [[ -d "${DOCKER_BUILD_DIR}" ]] && [[ -f "${DOCKER_BUILD_DIR}/${DOCKER_BUILD_SCRIPT}" ]] && [[ "${USE_CUSTOM_DOCKER_REGISTRY}" == "true" ]]; then 2025-12-04T08:54:16.6805148Z  echo "skip=false" >> "${GITHUB_OUTPUT}" 2025-12-04T08:54:16.6805283Z else 2025-12-04T08:54:16.6805395Z  echo "skip=true" >> "${GITHUB_OUTPUT}" 2025-12-04T08:54:16.6805571Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2025-12-04T08:54:16.6805727Z  2025-12-04T08:54:16.6805994Z  echo "Not using custom ECR registry. Either it was not requested or there is no Docker build script in the ${REPO_NAME} repo..." 2025-12-04T08:54:16.6806234Z  exit 0 2025-12-04T08:54:16.6806333Z fi 2025-12-04T08:54:16.6806432Z  2025-12-04T08:54:16.6806578Z if [[ "${DOCKER_IMAGE_NAME}" == *"${DOCKER_REGISTRY}/${REPO_NAME}"* ]]; then 2025-12-04T08:54:16.6806809Z  # The docker image name already includes the ECR prefix and tag, so we can just 2025-12-04T08:54:16.6807018Z  # use it as it is, but first let's extract the tag 2025-12-04T08:54:16.6807211Z  DOCKER_TAG=$(echo "${DOCKER_IMAGE_NAME}" | awk -F '[:,]' '{print $2}') 2025-12-04T08:54:16.6807411Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T08:54:16.6807600Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2025-12-04T08:54:16.6807756Z else 2025-12-04T08:54:16.6807873Z  if [[ "${DOCKER_IMAGE_NAME}" == *:* ]]; then 2025-12-04T08:54:16.6808029Z  CUSTOM_TAG_PREFIX=${DOCKER_IMAGE_NAME#*:} 2025-12-04T08:54:16.6808187Z  DOCKER_IMAGE_NAME=${DOCKER_IMAGE_NAME%%:*} 2025-12-04T08:54:16.6808322Z  fi 2025-12-04T08:54:16.6808606Z  DOCKER_TAG=${CUSTOM_TAG_PREFIX:+${CUSTOM_TAG_PREFIX}-}$(git rev-parse HEAD:"${DOCKER_BUILD_DIR}") 2025-12-04T08:54:16.6808839Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T08:54:16.6809080Z  echo "docker-image=${DOCKER_REGISTRY}/${REPO_NAME}/${DOCKER_IMAGE_NAME}:${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T08:54:16.6809343Z  echo "custom-tag-prefix=${CUSTOM_TAG_PREFIX}" >> "${GITHUB_OUTPUT}" 2025-12-04T08:54:16.6809509Z fi 2025-12-04T08:54:16.6813889Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:16.6814040Z env: 2025-12-04T08:54:16.6814141Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:16.6814282Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T08:54:16.6814463Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T08:54:16.6814634Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T08:54:16.6815023Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T08:54:16.6815404Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T08:54:16.6815526Z AWS_REGION: us-east-1 2025-12-04T08:54:16.6815687Z AWS_ACCESS_KEY_ID: *** 2025-12-04T08:54:16.6815844Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T08:54:16.6817937Z AWS_SESSION_TOKEN: *** 2025-12-04T08:54:16.6818050Z REPO_NAME: pytorch 2025-12-04T08:54:16.6818329Z DOCKER_IMAGE_NAME: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:16.6818624Z DOCKER_BUILD_DIR: .ci/docker 2025-12-04T08:54:16.6818746Z DOCKER_BUILD_SCRIPT: ./build.sh 2025-12-04T08:54:16.6818901Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:16.6819064Z USE_CUSTOM_DOCKER_REGISTRY: true 2025-12-04T08:54:16.6819188Z CUSTOM_TAG_PREFIX: 2025-12-04T08:54:16.6819363Z ##[endgroup] 2025-12-04T08:54:16.6842113Z + [[ -d .ci/docker ]] 2025-12-04T08:54:16.6842288Z + [[ -f .ci/docker/./build.sh ]] 2025-12-04T08:54:16.6842420Z + [[ true == \t\r\u\e ]] 2025-12-04T08:54:16.6842536Z + echo skip=false 2025-12-04T08:54:16.6842915Z + [[ 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a == *\3\0\8\5\3\5\3\8\5\1\1\4\.\d\k\r\.\e\c\r\.\u\s\-\e\a\s\t\-\1\.\a\m\a\z\o\n\a\w\s\.\c\o\m\/\p\y\t\o\r\c\h* ]] 2025-12-04T08:54:16.6843465Z ++ echo 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:16.6843830Z ++ awk -F '[:,]' '{print $2}' 2025-12-04T08:54:16.6855192Z + DOCKER_TAG=pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:16.6875889Z + echo docker-tag=pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:16.6876462Z + echo docker-image=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:16.6905494Z ##[group]Run set +e 2025-12-04T08:54:16.6905667Z set +e 2025-12-04T08:54:16.6905774Z set -x 2025-12-04T08:54:16.6905880Z  2025-12-04T08:54:16.6906044Z login() { 2025-12-04T08:54:16.6906250Z  aws ecr get-login-password --region us-east-1 | docker login -u AWS --password-stdin "$1" 2025-12-04T08:54:16.6906458Z } 2025-12-04T08:54:16.6906559Z  2025-12-04T08:54:16.6906661Z retry () { 2025-12-04T08:54:16.6906789Z  $* || (sleep 1 && $*) || (sleep 2 && $*) 2025-12-04T08:54:16.6906927Z } 2025-12-04T08:54:16.6907026Z  2025-12-04T08:54:16.6907132Z retry login "${DOCKER_REGISTRY}" 2025-12-04T08:54:16.6907260Z  2025-12-04T08:54:16.6907361Z START_TIME=$(date +%s) 2025-12-04T08:54:16.6907510Z # Wait up to 120 minutes 2025-12-04T08:54:16.6907834Z while [[ $(( $(date +%s) - 7200 )) -lt $START_TIME ]]; do 2025-12-04T08:54:16.6908032Z  # Check if image already exists, if it does then skip building it 2025-12-04T08:54:16.6908229Z  if docker manifest inspect "${DOCKER_IMAGE}"; then 2025-12-04T08:54:16.6908377Z  exit 0 2025-12-04T08:54:16.6908487Z  fi 2025-12-04T08:54:16.6908586Z  2025-12-04T08:54:16.6908745Z  # NB: This flag is used by Docker build workflow to push the image to ECR, so we can 2025-12-04T08:54:16.6908995Z  # use this to differentiate between the Docker build and regular build jobs. For the 2025-12-04T08:54:16.6909243Z  # latter, it will wait for the Docker images to become available before continuing 2025-12-04T08:54:16.6909448Z  if [ "${DOCKER_PUSH:-false}" == "true" ]; then 2025-12-04T08:54:16.6909615Z  # It's a Docker build job, let's build the image 2025-12-04T08:54:16.6909764Z  break 2025-12-04T08:54:16.6909875Z  else 2025-12-04T08:54:16.6910020Z  # It's a regular build job, wait for the image to become available 2025-12-04T08:54:16.6910187Z  sleep 300 2025-12-04T08:54:16.6910299Z  fi 2025-12-04T08:54:16.6910400Z done 2025-12-04T08:54:16.6910499Z  2025-12-04T08:54:16.6910649Z # NB: This part requires a full checkout. Otherwise, the merge base will 2025-12-04T08:54:16.6910869Z # be empty. The default action would be to continue rebuild the image 2025-12-04T08:54:16.6911076Z if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then 2025-12-04T08:54:16.6911260Z  # if we're on the base branch then use the parent commit 2025-12-04T08:54:16.6911427Z  MERGE_BASE=$(git rev-parse HEAD~) 2025-12-04T08:54:16.6911559Z else 2025-12-04T08:54:16.6911695Z  # otherwise we're on a PR, so use the most recent base commit 2025-12-04T08:54:16.6912021Z  MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION") 2025-12-04T08:54:16.6912170Z fi 2025-12-04T08:54:16.6912266Z  2025-12-04T08:54:16.6912373Z if [[ -z "${MERGE_BASE}" ]]; then 2025-12-04T08:54:16.6912523Z  echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2025-12-04T08:54:16.6912660Z  2025-12-04T08:54:16.6912846Z  echo "Finding merge base only works with full checkout, please set fetch-depth to 0, continuing ..." 2025-12-04T08:54:16.6913055Z  exit 0 2025-12-04T08:54:16.6913158Z fi 2025-12-04T08:54:16.6913248Z  2025-12-04T08:54:16.6913381Z if ! git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}"; then 2025-12-04T08:54:16.6913637Z  echo "Directory '${DOCKER_BUILD_DIR}' not found in commit $MERGE_BASE, you should rebase onto a more recent commit" 2025-12-04T08:54:16.6913858Z  exit 1 2025-12-04T08:54:16.6913988Z fi 2025-12-04T08:54:16.6914081Z  2025-12-04T08:54:16.6914234Z PREVIOUS_DOCKER_TAG=$(git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}") 2025-12-04T08:54:16.6914480Z # If no image exists but the hash is the same as the previous hash then we should error out here 2025-12-04T08:54:16.6914702Z if [[ "${PREVIOUS_DOCKER_TAG}" == "${DOCKER_TAG}" ]]; then 2025-12-04T08:54:16.6914954Z  echo "WARNING: Something has gone wrong and the previous image isn't available for the merge-base of your branch" 2025-12-04T08:54:16.6915234Z  echo " Will re-build docker image to store in local cache, TTS may be longer" 2025-12-04T08:54:16.6915410Z fi 2025-12-04T08:54:16.6915502Z  2025-12-04T08:54:16.6915618Z echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2025-12-04T08:54:16.6920108Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:16.6920264Z env: 2025-12-04T08:54:16.6920365Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:16.6920514Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T08:54:16.6920760Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T08:54:16.6920933Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T08:54:16.6921322Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T08:54:16.6921700Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T08:54:16.6921824Z AWS_REGION: us-east-1 2025-12-04T08:54:16.6922064Z AWS_ACCESS_KEY_ID: *** 2025-12-04T08:54:16.6922225Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T08:54:16.6924229Z AWS_SESSION_TOKEN: *** 2025-12-04T08:54:16.6924347Z DOCKER_BUILD_DIR: .ci/docker 2025-12-04T08:54:16.6924495Z BASE_REVISION: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T08:54:16.6924815Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:16.6925179Z DOCKER_TAG: pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:16.6925415Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:16.6925573Z DOCKER_PUSH: 2025-12-04T08:54:16.6925678Z ##[endgroup] 2025-12-04T08:54:16.6996265Z + retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:16.6996455Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:16.6996629Z + aws ecr get-login-password --region us-east-1 2025-12-04T08:54:16.6996857Z /home/runner/_work/_temp/fe4c237f-0665-4452-86f9-6e5c837c9c1e.sh: line 5: aws: command not found 2025-12-04T08:54:16.6997121Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:16.7036278Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T08:54:16.7053346Z + sleep 1 2025-12-04T08:54:17.7087316Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:17.7087643Z + aws ecr get-login-password --region us-east-1 2025-12-04T08:54:17.7087880Z /home/runner/_work/_temp/fe4c237f-0665-4452-86f9-6e5c837c9c1e.sh: line 5: aws: command not found 2025-12-04T08:54:17.7088147Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:17.7148603Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T08:54:17.7164950Z + sleep 2 2025-12-04T08:54:19.7182070Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:19.7182430Z + aws ecr get-login-password --region us-east-1 2025-12-04T08:54:19.7182678Z /home/runner/_work/_temp/fe4c237f-0665-4452-86f9-6e5c837c9c1e.sh: line 5: aws: command not found 2025-12-04T08:54:19.7183003Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:19.7267840Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T08:54:19.7282309Z ++ date +%s 2025-12-04T08:54:19.7293065Z + START_TIME=1764838459 2025-12-04T08:54:19.7295892Z ++ date +%s 2025-12-04T08:54:19.7303161Z + [[ 1764831259 -lt 1764838459 ]] 2025-12-04T08:54:19.7303501Z + docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:21.1026081Z { 2025-12-04T08:54:21.1026294Z "schemaVersion": 2, 2025-12-04T08:54:21.1026524Z "mediaType": "application/vnd.docker.distribution.manifest.v2+json", 2025-12-04T08:54:21.1026712Z "config": { 2025-12-04T08:54:21.1026860Z "mediaType": "application/vnd.docker.container.image.v1+json", 2025-12-04T08:54:21.1027023Z "size": 30520, 2025-12-04T08:54:21.1027212Z "digest": "sha256:45252333063339f104d56e41f20304e9511ab21c7768e8d156b95ddf24a9dbe5" 2025-12-04T08:54:21.1027388Z }, 2025-12-04T08:54:21.1027482Z "layers": [ 2025-12-04T08:54:21.1027574Z { 2025-12-04T08:54:21.1027715Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1027922Z "size": 30447951, 2025-12-04T08:54:21.1028424Z "digest": "sha256:63e5bc7682b85ae57a1221210f64d62e7a90b0a30f19af4ca734b8242ae49d63" 2025-12-04T08:54:21.1028718Z }, 2025-12-04T08:54:21.1028816Z { 2025-12-04T08:54:21.1028952Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1029111Z "size": 1554, 2025-12-04T08:54:21.1029274Z "digest": "sha256:835841cca3b7e1464290cdb78e48773e03583413fbed852c3cc5165a392ea44d" 2025-12-04T08:54:21.1029451Z }, 2025-12-04T08:54:21.1029536Z { 2025-12-04T08:54:21.1029662Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1029826Z "size": 313275691, 2025-12-04T08:54:21.1030030Z + exit 0 2025-12-04T08:54:21.1030184Z "digest": "sha256:aac69780afc8611a5f94a235792d39ae055249c8319ef43b78675998a9b2f825" 2025-12-04T08:54:21.1030357Z }, 2025-12-04T08:54:21.1030440Z { 2025-12-04T08:54:21.1030570Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1030733Z "size": 704, 2025-12-04T08:54:21.1030897Z "digest": "sha256:029495b23122c840ca0e52d487afa8d2c4dbf1991cd7f204ec3e434dcf947bf4" 2025-12-04T08:54:21.1031073Z }, 2025-12-04T08:54:21.1031160Z { 2025-12-04T08:54:21.1031291Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1031450Z "size": 1218, 2025-12-04T08:54:21.1031612Z "digest": "sha256:d0fb85b008332051a3f7c052721ef68bde404b46c23fa43ad040373bd367826c" 2025-12-04T08:54:21.1031789Z }, 2025-12-04T08:54:21.1031871Z { 2025-12-04T08:54:21.1032000Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1032158Z "size": 484, 2025-12-04T08:54:21.1032326Z "digest": "sha256:59b63930883363c7d2aaab27cc61555d9f3e119dc18247a8624c98ebdaa354a5" 2025-12-04T08:54:21.1032661Z }, 2025-12-04T08:54:21.1032749Z { 2025-12-04T08:54:21.1032880Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1033177Z "size": 110363202, 2025-12-04T08:54:21.1033349Z "digest": "sha256:dc112c89d57aa1e85082e40a56e5bc743d64f834ae2f98afe91f60c248354d38" 2025-12-04T08:54:21.1033524Z }, 2025-12-04T08:54:21.1033602Z { 2025-12-04T08:54:21.1033731Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1033889Z "size": 4436, 2025-12-04T08:54:21.1034049Z "digest": "sha256:522eab2402e5001810155ef7eb56940b7c01a4fef62ac588886981c3b8ee8e1e" 2025-12-04T08:54:21.1034221Z }, 2025-12-04T08:54:21.1034304Z { 2025-12-04T08:54:21.1034433Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1034593Z "size": 1755, 2025-12-04T08:54:21.1034757Z "digest": "sha256:2b5a11b41761d8ea3b829e4772e4064cb6c4e4989126af324d0057661e4493a1" 2025-12-04T08:54:21.1034935Z }, 2025-12-04T08:54:21.1035018Z { 2025-12-04T08:54:21.1035150Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1035309Z "size": 724, 2025-12-04T08:54:21.1035467Z "digest": "sha256:9681563a88ff9e62494a2740e537440d3df978d466c9478d6a941fae8b57b084" 2025-12-04T08:54:21.1035640Z }, 2025-12-04T08:54:21.1035723Z { 2025-12-04T08:54:21.1035853Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1036073Z "size": 3185588166, 2025-12-04T08:54:21.1036240Z "digest": "sha256:73e33534e9eb94cf29418d65944168962b65fe21f55e9b8bad18c76e9b3a37b8" 2025-12-04T08:54:21.1036414Z }, 2025-12-04T08:54:21.1036497Z { 2025-12-04T08:54:21.1036626Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1036780Z "size": 396, 2025-12-04T08:54:21.1036947Z "digest": "sha256:5bfdaeb5578d6ffcd7db29c48303cbceb13c591210feaa216a8daa7a6d445b4b" 2025-12-04T08:54:21.1037128Z }, 2025-12-04T08:54:21.1037212Z { 2025-12-04T08:54:21.1037372Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1037534Z "size": 236863, 2025-12-04T08:54:21.1037701Z "digest": "sha256:c07d27e4d3a5ba4ad5325bb785b2e4f058fe5e10ec1aeeb413a1e152b073f203" 2025-12-04T08:54:21.1037933Z }, 2025-12-04T08:54:21.1038015Z { 2025-12-04T08:54:21.1038146Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1038304Z "size": 787, 2025-12-04T08:54:21.1038465Z "digest": "sha256:b21856d1bf420da6fa8ec7331b82ab355d4f4178644e7d3a3d3d0fbc3610109a" 2025-12-04T08:54:21.1038643Z }, 2025-12-04T08:54:21.1038726Z { 2025-12-04T08:54:21.1038856Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1039013Z "size": 106, 2025-12-04T08:54:21.1039166Z "digest": "sha256:cb19d84867e4063f55db9459c28c50a2abc37c06d3c1ca82ba95fa8427cc438a" 2025-12-04T08:54:21.1039341Z }, 2025-12-04T08:54:21.1039423Z { 2025-12-04T08:54:21.1039554Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1039716Z "size": 1496, 2025-12-04T08:54:21.1039876Z "digest": "sha256:8165374f8dccf88a7791a5d31afbe29e4d4542b4f1cf1904945e07f9af6bf8ba" 2025-12-04T08:54:21.1040054Z }, 2025-12-04T08:54:21.1040140Z { 2025-12-04T08:54:21.1040269Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1040428Z "size": 458789560, 2025-12-04T08:54:21.1040599Z "digest": "sha256:1aecc77354ceba59ec6f0d37a558f2dbb6d5c0854553ee8505ac8707b422da6d" 2025-12-04T08:54:21.1040777Z }, 2025-12-04T08:54:21.1040856Z { 2025-12-04T08:54:21.1040984Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1041142Z "size": 164, 2025-12-04T08:54:21.1041303Z "digest": "sha256:465d3fd643aa2ea0ad07335cda66f12f1d7e5e800c4e9385ec466bc8a1ceabda" 2025-12-04T08:54:21.1041481Z }, 2025-12-04T08:54:21.1041563Z { 2025-12-04T08:54:21.1041692Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1041849Z "size": 104, 2025-12-04T08:54:21.1042006Z "digest": "sha256:6c503e779d6f41ca7f51309875df2b725c171926aece7009c4b8a64d1ba3f58e" 2025-12-04T08:54:21.1042225Z }, 2025-12-04T08:54:21.1042303Z { 2025-12-04T08:54:21.1042433Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1042590Z "size": 724, 2025-12-04T08:54:21.1042744Z "digest": "sha256:9681563a88ff9e62494a2740e537440d3df978d466c9478d6a941fae8b57b084" 2025-12-04T08:54:21.1042914Z }, 2025-12-04T08:54:21.1042995Z { 2025-12-04T08:54:21.1043127Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1043286Z "size": 196, 2025-12-04T08:54:21.1043444Z "digest": "sha256:f7e9a021f0ee3d11a50dcb96378af8103a21f6c3c142f54529207648f3ed00b2" 2025-12-04T08:54:21.1043621Z }, 2025-12-04T08:54:21.1043701Z { 2025-12-04T08:54:21.1043832Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1043991Z "size": 2583, 2025-12-04T08:54:21.1044150Z "digest": "sha256:8e023b349080fb11ee55491bc9b842b30e9e3a90246d05b303a73dc62038caf2" 2025-12-04T08:54:21.1044327Z }, 2025-12-04T08:54:21.1044409Z { 2025-12-04T08:54:21.1044540Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1044701Z "size": 7577171420, 2025-12-04T08:54:21.1044869Z "digest": "sha256:8188df80e595a3dbcf84623c6a58a655269898cbb60029435f136d7f9d34ccaa" 2025-12-04T08:54:21.1045046Z }, 2025-12-04T08:54:21.1045129Z { 2025-12-04T08:54:21.1045264Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1045418Z "size": 135, 2025-12-04T08:54:21.1045582Z "digest": "sha256:3c2c2f8c74bfa16c4bf9a832c97bbb1d55205b2b4a2cead02cf74301ca1001fb" 2025-12-04T08:54:21.1045763Z }, 2025-12-04T08:54:21.1045847Z { 2025-12-04T08:54:21.1046047Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1046202Z "size": 104, 2025-12-04T08:54:21.1046363Z "digest": "sha256:2aa7784fbe3300f8bbfb6bb51cff3b01fd091e829c2bc7ab9e25261a0dd9b3bd" 2025-12-04T08:54:21.1046542Z }, 2025-12-04T08:54:21.1046627Z { 2025-12-04T08:54:21.1046756Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1046948Z "size": 612, 2025-12-04T08:54:21.1047110Z "digest": "sha256:2b3b5215d3ebe8789f0444457bfd5a6e218289b64aa07653ac3d03ddda5e6708" 2025-12-04T08:54:21.1047286Z }, 2025-12-04T08:54:21.1047369Z { 2025-12-04T08:54:21.1047498Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1047659Z "size": 838191945, 2025-12-04T08:54:21.1047828Z "digest": "sha256:99b1f1ea3e857834cebd01763d90fbd700aeb9c2d2ef23eda2cfff5652c9708b" 2025-12-04T08:54:21.1048006Z }, 2025-12-04T08:54:21.1048088Z { 2025-12-04T08:54:21.1048215Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1048366Z "size": 111, 2025-12-04T08:54:21.1048523Z "digest": "sha256:18d6daba0a5768a37ad106b57974f6b7efd35c43a87c246bcd3f43fea88f2d2b" 2025-12-04T08:54:21.1048696Z }, 2025-12-04T08:54:21.1048772Z { 2025-12-04T08:54:21.1048895Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1049049Z "size": 1555, 2025-12-04T08:54:21.1049206Z "digest": "sha256:5277f2a503ebd17ba9d9b86cc9bac86265504adeb449c0647616ddaacd3cbc41" 2025-12-04T08:54:21.1049377Z }, 2025-12-04T08:54:21.1049455Z { 2025-12-04T08:54:21.1049578Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1049730Z "size": 107, 2025-12-04T08:54:21.1049886Z "digest": "sha256:3198a9717aace920fd5de085319adf75091af05fc4318ce4b16a8a5b0e8d449e" 2025-12-04T08:54:21.1050059Z }, 2025-12-04T08:54:21.1050135Z { 2025-12-04T08:54:21.1050260Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1050413Z "size": 166, 2025-12-04T08:54:21.1050565Z "digest": "sha256:99a4918e5808277879449e97ccd7190db6b9aa2d742b57a3b831ce0198522bdd" 2025-12-04T08:54:21.1050732Z }, 2025-12-04T08:54:21.1050808Z { 2025-12-04T08:54:21.1050931Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1051116Z "size": 3526081, 2025-12-04T08:54:21.1051276Z "digest": "sha256:15bb11dfc6acc3537d527d6771c8e711e5605e99f82ec41e805d4600b8a97516" 2025-12-04T08:54:21.1051446Z }, 2025-12-04T08:54:21.1051522Z { 2025-12-04T08:54:21.1051643Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1051795Z "size": 107, 2025-12-04T08:54:21.1051955Z "digest": "sha256:bd87c8766e90e33db17514558ac591cc3f4149afd7abeaef4dd5770bbfa14210" 2025-12-04T08:54:21.1052129Z }, 2025-12-04T08:54:21.1052209Z { 2025-12-04T08:54:21.1052336Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1052489Z "size": 829, 2025-12-04T08:54:21.1052643Z "digest": "sha256:1969e15d0c13874ea5883ed829235a19ef6dc21c8aa6172032b78a8ffa6ff262" 2025-12-04T08:54:21.1052811Z }, 2025-12-04T08:54:21.1052888Z { 2025-12-04T08:54:21.1053013Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1053170Z "size": 26973054, 2025-12-04T08:54:21.1053335Z "digest": "sha256:24a03847d382b73c11969f8f73916a6bedf5ccea12f6f4290b3880f29ceda32a" 2025-12-04T08:54:21.1053506Z }, 2025-12-04T08:54:21.1053584Z { 2025-12-04T08:54:21.1053707Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1053860Z "size": 104, 2025-12-04T08:54:21.1054016Z "digest": "sha256:816e2e34e01839a35d624dbf4bd9ac9bea4c975104af47a0e6b6b6dee6c6f98d" 2025-12-04T08:54:21.1054188Z }, 2025-12-04T08:54:21.1054265Z { 2025-12-04T08:54:21.1054388Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1054541Z "size": 424, 2025-12-04T08:54:21.1054692Z "digest": "sha256:b168858b85373f8ddca549d79267a06de4fa945d04bf791c55c9ddc93957fa3c" 2025-12-04T08:54:21.1054862Z }, 2025-12-04T08:54:21.1054943Z { 2025-12-04T08:54:21.1055067Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1055221Z "size": 19309386, 2025-12-04T08:54:21.1055394Z "digest": "sha256:6b8d5ff02e267e38322afbb8a58ed63ce9d75b10e9e73255e6affcbc6b6539bf" 2025-12-04T08:54:21.1055625Z }, 2025-12-04T08:54:21.1055704Z { 2025-12-04T08:54:21.1055828Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1056025Z "size": 826, 2025-12-04T08:54:21.1056181Z "digest": "sha256:4e3b10a5dd6aed29f238d604925e2a4f873141c1087c8dd4fdde5c61e7560893" 2025-12-04T08:54:21.1056351Z }, 2025-12-04T08:54:21.1056428Z { 2025-12-04T08:54:21.1056552Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1056705Z "size": 724, 2025-12-04T08:54:21.1056856Z "digest": "sha256:9681563a88ff9e62494a2740e537440d3df978d466c9478d6a941fae8b57b084" 2025-12-04T08:54:21.1057022Z }, 2025-12-04T08:54:21.1057103Z { 2025-12-04T08:54:21.1057226Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1057377Z "size": 149, 2025-12-04T08:54:21.1057534Z "digest": "sha256:3092fab73b59190b9facfc49bf18f58612172bc2fd68dfa339a1118632616939" 2025-12-04T08:54:21.1057708Z }, 2025-12-04T08:54:21.1057787Z { 2025-12-04T08:54:21.1057910Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1058061Z "size": 136, 2025-12-04T08:54:21.1058218Z "digest": "sha256:20020dd28a15ba092fcbfe906ee39cdddfcc9d0b7eb42fdd6f4c08a984fa9c00" 2025-12-04T08:54:21.1058395Z }, 2025-12-04T08:54:21.1058472Z { 2025-12-04T08:54:21.1058595Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1058747Z "size": 140, 2025-12-04T08:54:21.1058904Z "digest": "sha256:ae5280ce969dcff08c091e9a5f7641f13561b2b0ee44d78b7c3f81d8fe8e6d32" 2025-12-04T08:54:21.1059075Z }, 2025-12-04T08:54:21.1059151Z { 2025-12-04T08:54:21.1059273Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1059424Z "size": 32, 2025-12-04T08:54:21.1059582Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T08:54:21.1059792Z }, 2025-12-04T08:54:21.1059870Z { 2025-12-04T08:54:21.1059996Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1060149Z "size": 222, 2025-12-04T08:54:21.1060306Z "digest": "sha256:fe17d9eb0fd26d3af4c724bf570d833978b131cedb7dc17a800aa388a246b3cd" 2025-12-04T08:54:21.1060479Z }, 2025-12-04T08:54:21.1060555Z { 2025-12-04T08:54:21.1060675Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1060827Z "size": 346, 2025-12-04T08:54:21.1060980Z "digest": "sha256:a51e0dab2d596e6563483f27c12660007160847d177ba4c31812a8f44ada5754" 2025-12-04T08:54:21.1061147Z }, 2025-12-04T08:54:21.1061224Z { 2025-12-04T08:54:21.1061349Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1061501Z "size": 88300, 2025-12-04T08:54:21.1061672Z "digest": "sha256:6eb176cefd72d37ecbcdf074289a8f1de732d8816cc695ece7e4709d098094d6" 2025-12-04T08:54:21.1061923Z }, 2025-12-04T08:54:21.1062053Z { 2025-12-04T08:54:21.1062314Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1062504Z "size": 106, 2025-12-04T08:54:21.1062695Z "digest": "sha256:e7b8cf2e8d5a4c56db9726ce62c1176032408b3b1c25a000592361cb4245e2b5" 2025-12-04T08:54:21.1062926Z }, 2025-12-04T08:54:21.1063029Z { 2025-12-04T08:54:21.1063192Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1063411Z "size": 1671, 2025-12-04T08:54:21.1063615Z "digest": "sha256:ef3a5060abce88884bc8bd815aa41c46427f34eeb132fe0ddd85a3f86e6dc83d" 2025-12-04T08:54:21.1075743Z }, 2025-12-04T08:54:21.1075840Z { 2025-12-04T08:54:21.1076066Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1076230Z "size": 724, 2025-12-04T08:54:21.1076388Z "digest": "sha256:9681563a88ff9e62494a2740e537440d3df978d466c9478d6a941fae8b57b084" 2025-12-04T08:54:21.1076561Z }, 2025-12-04T08:54:21.1076645Z { 2025-12-04T08:54:21.1076780Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1077007Z "size": 138, 2025-12-04T08:54:21.1077174Z "digest": "sha256:a6f4ec14b42b8f0a83d20aa6a985ddb6a1bf64e0ed3d44afd3484b87d4ed5ad3" 2025-12-04T08:54:21.1077353Z }, 2025-12-04T08:54:21.1077436Z { 2025-12-04T08:54:21.1077567Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1077723Z "size": 119, 2025-12-04T08:54:21.1077888Z "digest": "sha256:7e5a0c956cfbd6f8074fbfd3b1d416e6635d632835ec00c8dd4c015a21da19b4" 2025-12-04T08:54:21.1078067Z }, 2025-12-04T08:54:21.1078142Z { 2025-12-04T08:54:21.1078272Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1078434Z "size": 6238423049, 2025-12-04T08:54:21.1078606Z "digest": "sha256:b4f78730cfe76ce091b78b2e2e3d52be03f1097b3e4c3de5bd79f8d13a853132" 2025-12-04T08:54:21.1078784Z }, 2025-12-04T08:54:21.1078867Z { 2025-12-04T08:54:21.1078999Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1079159Z "size": 174, 2025-12-04T08:54:21.1079319Z "digest": "sha256:081028f24389b112683689fd362e8c0d6f358082710e72feab91cea6383feb4d" 2025-12-04T08:54:21.1079491Z }, 2025-12-04T08:54:21.1079573Z { 2025-12-04T08:54:21.1079702Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1079859Z "size": 1896, 2025-12-04T08:54:21.1080025Z "digest": "sha256:a534dcf4b9a9e5fabed742c8a8fc43c9cfe7346ea88ab3c177c3b14fd3afe00a" 2025-12-04T08:54:21.1080205Z }, 2025-12-04T08:54:21.1080287Z { 2025-12-04T08:54:21.1080417Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1080577Z "size": 197577597, 2025-12-04T08:54:21.1080743Z "digest": "sha256:2e77500302cc13224427e1d74e471bd79d5109ba6a5099a83df1d10b786f71ba" 2025-12-04T08:54:21.1080920Z }, 2025-12-04T08:54:21.1081003Z { 2025-12-04T08:54:21.1081133Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1081330Z "size": 304, 2025-12-04T08:54:21.1081496Z "digest": "sha256:bc08246bb4ba18c3ec5bc69e16b6b4e929c5bd0f3fae10eeb0b1a622a63d6fa2" 2025-12-04T08:54:21.1081678Z }, 2025-12-04T08:54:21.1081759Z { 2025-12-04T08:54:21.1081887Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1082041Z "size": 32, 2025-12-04T08:54:21.1082202Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T08:54:21.1082377Z }, 2025-12-04T08:54:21.1082457Z { 2025-12-04T08:54:21.1082584Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1082739Z "size": 106, 2025-12-04T08:54:21.1082899Z "digest": "sha256:ff0c473ca120ebdcaa2ba10b3274e82032edd5196019e76d4e7584553704ae81" 2025-12-04T08:54:21.1083075Z }, 2025-12-04T08:54:21.1083158Z { 2025-12-04T08:54:21.1083287Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T08:54:21.1083447Z "size": 54145662, 2025-12-04T08:54:21.1083619Z "digest": "sha256:6bbc14b250efb3cdaad12c91573c6bb9129ad3e3432f0ed1a7eaebc9958d162f" 2025-12-04T08:54:21.1083800Z } 2025-12-04T08:54:21.1083885Z ] 2025-12-04T08:54:21.1083971Z } 2025-12-04T08:54:21.1103266Z ##[group]Run set -eux 2025-12-04T08:54:21.1103393Z set -eux 2025-12-04T08:54:21.1103558Z # It's ok if this steps fails, it would then be an anonymous user like what we used to have 2025-12-04T08:54:21.1103978Z aws secretsmanager get-secret-value --secret-id docker_hub_readonly_token | jq --raw-output '.SecretString' | jq -r .docker_hub_readonly_token | docker login --username pytorchbot --password-stdin || true 2025-12-04T08:54:21.1108483Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:21.1108638Z env: 2025-12-04T08:54:21.1108739Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:21.1108880Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T08:54:21.1109062Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T08:54:21.1109237Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T08:54:21.1109668Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T08:54:21.1110041Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T08:54:21.1110166Z AWS_REGION: us-east-1 2025-12-04T08:54:21.1110394Z AWS_ACCESS_KEY_ID: *** 2025-12-04T08:54:21.1110552Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T08:54:21.1112579Z AWS_SESSION_TOKEN: *** 2025-12-04T08:54:21.1112694Z ##[endgroup] 2025-12-04T08:54:21.1153557Z + aws secretsmanager get-secret-value --secret-id docker_hub_readonly_token 2025-12-04T08:54:21.1154192Z + jq --raw-output .SecretString 2025-12-04T08:54:21.1154401Z /home/runner/_work/_temp/4a4e631b-ae79-4b08-8df0-5679b82c7487.sh: line 3: aws: command not found 2025-12-04T08:54:21.1154637Z + jq -r .docker_hub_readonly_token 2025-12-04T08:54:21.1154827Z + docker login --username pytorchbot --password-stdin 2025-12-04T08:54:21.1237552Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T08:54:21.1254033Z + true 2025-12-04T08:54:21.1328320Z ##[group]Run pytorch/test-infra/.github/actions/pull-docker-image@main 2025-12-04T08:54:21.1328506Z with: 2025-12-04T08:54:21.1328776Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:21.1329102Z docker-registry: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:21.1329258Z env: 2025-12-04T08:54:21.1329354Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:21.1329496Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T08:54:21.1329673Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T08:54:21.1329842Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T08:54:21.1330389Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T08:54:21.1330775Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T08:54:21.1330893Z AWS_REGION: us-east-1 2025-12-04T08:54:21.1331122Z AWS_ACCESS_KEY_ID: *** 2025-12-04T08:54:21.1331275Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T08:54:21.1333316Z AWS_SESSION_TOKEN: *** 2025-12-04T08:54:21.1333423Z ##[endgroup] 2025-12-04T08:54:21.1344480Z ##[group]Run set -x 2025-12-04T08:54:21.1344605Z set -x 2025-12-04T08:54:21.1344699Z set +e 2025-12-04T08:54:21.1344790Z  2025-12-04T08:54:21.1344881Z login() { 2025-12-04T08:54:21.1345072Z  aws ecr get-login-password --region us-east-1 | docker login -u AWS --password-stdin "$1" 2025-12-04T08:54:21.1345269Z } 2025-12-04T08:54:21.1345358Z  2025-12-04T08:54:21.1345456Z retry () { 2025-12-04T08:54:21.1345593Z  $* || (sleep 1 && $*) || (sleep 2 && $*) 2025-12-04T08:54:21.1345725Z } 2025-12-04T08:54:21.1345814Z  2025-12-04T08:54:21.1345914Z retry login "${DOCKER_REGISTRY}" 2025-12-04T08:54:21.1346103Z  2025-12-04T08:54:21.1346297Z IMAGE_SIZE=$(docker manifest inspect "${DOCKER_IMAGE}" | jq '[.layers[].size, .config.size] | add / 1024 / 1024') 2025-12-04T08:54:21.1346548Z echo "Compressed size of image in MB: ${IMAGE_SIZE}" 2025-12-04T08:54:21.1346696Z  2025-12-04T08:54:21.1346788Z set -e 2025-12-04T08:54:21.1346929Z # ignore output since only exit code is used for conditional 2025-12-04T08:54:21.1347121Z # only pull docker image if it's not available locally 2025-12-04T08:54:21.1347330Z if ! docker inspect --type=image "${DOCKER_IMAGE}" >/dev/null 2>/dev/null; then 2025-12-04T08:54:21.1347524Z  retry docker pull "${DOCKER_IMAGE}" 2025-12-04T08:54:21.1347659Z fi 2025-12-04T08:54:21.1351641Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T08:54:21.1351791Z env: 2025-12-04T08:54:21.1351886Z GIT_DEFAULT_BRANCH: main 2025-12-04T08:54:21.1352031Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T08:54:21.1352214Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T08:54:21.1352384Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T08:54:21.1352773Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T08:54:21.1353152Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T08:54:21.1353269Z AWS_REGION: us-east-1 2025-12-04T08:54:21.1353436Z AWS_ACCESS_KEY_ID: *** 2025-12-04T08:54:21.1353590Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T08:54:21.1355630Z AWS_SESSION_TOKEN: *** 2025-12-04T08:54:21.1355913Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:21.1356456Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:21.1356616Z ##[endgroup] 2025-12-04T08:54:21.1375817Z + set +e 2025-12-04T08:54:21.1376014Z + retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:21.1376187Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:21.1379633Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:21.1380020Z + aws ecr get-login-password --region us-east-1 2025-12-04T08:54:21.1380300Z /home/runner/_work/_temp/1fabfea6-82fb-4041-8ad3-9f1db905c753.sh: line 5: aws: command not found 2025-12-04T08:54:21.1452152Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T08:54:21.1459699Z + sleep 1 2025-12-04T08:54:22.1471176Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:22.1475556Z + aws ecr get-login-password --region us-east-1 2025-12-04T08:54:22.1476451Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:22.1477353Z /home/runner/_work/_temp/1fabfea6-82fb-4041-8ad3-9f1db905c753.sh: line 5: aws: command not found 2025-12-04T08:54:22.1559157Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T08:54:22.1571302Z + sleep 2 2025-12-04T08:54:24.1583945Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:24.1585252Z + aws ecr get-login-password --region us-east-1 2025-12-04T08:54:24.1585517Z /home/runner/_work/_temp/1fabfea6-82fb-4041-8ad3-9f1db905c753.sh: line 5: aws: command not found 2025-12-04T08:54:24.1585795Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T08:54:24.1737594Z Error: Cannot perform an interactive login from a non TTY device 2025-12-04T08:54:24.1788637Z ++ docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:24.1789014Z ++ jq '[.layers[].size, .config.size] | add / 1024 / 1024' 2025-12-04T08:54:25.5718805Z + IMAGE_SIZE=18171.470620155334 2025-12-04T08:54:25.5719062Z + echo 'Compressed size of image in MB: 18171.470620155334' 2025-12-04T08:54:25.5719342Z + set -e 2025-12-04T08:54:25.5719473Z Compressed size of image in MB: 18171.470620155334 2025-12-04T08:54:25.5719885Z + docker inspect --type=image 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:25.5829573Z + retry docker pull 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:25.5830125Z + docker pull 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T08:54:26.6885464Z pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a: Pulling from pytorch/ci-image 2025-12-04T08:54:26.6886634Z 63e5bc7682b8: Pulling fs layer 2025-12-04T08:54:26.6887085Z 835841cca3b7: Pulling fs layer 2025-12-04T08:54:26.6887457Z aac69780afc8: Pulling fs layer 2025-12-04T08:54:26.6887878Z 029495b23122: Pulling fs layer 2025-12-04T08:54:26.6888315Z d0fb85b00833: Pulling fs layer 2025-12-04T08:54:26.6888675Z 59b639308833: Pulling fs layer 2025-12-04T08:54:26.6889028Z dc112c89d57a: Pulling fs layer 2025-12-04T08:54:26.6889376Z 522eab2402e5: Pulling fs layer 2025-12-04T08:54:26.6889729Z 2b5a11b41761: Pulling fs layer 2025-12-04T08:54:26.6890080Z 9681563a88ff: Pulling fs layer 2025-12-04T08:54:26.6890433Z 73e33534e9eb: Pulling fs layer 2025-12-04T08:54:26.6890789Z 5bfdaeb5578d: Pulling fs layer 2025-12-04T08:54:26.6891146Z c07d27e4d3a5: Pulling fs layer 2025-12-04T08:54:26.6891498Z b21856d1bf42: Pulling fs layer 2025-12-04T08:54:26.6891866Z cb19d84867e4: Pulling fs layer 2025-12-04T08:54:26.6892215Z 8165374f8dcc: Pulling fs layer 2025-12-04T08:54:26.6892566Z 1aecc77354ce: Pulling fs layer 2025-12-04T08:54:26.6892918Z 465d3fd643aa: Pulling fs layer 2025-12-04T08:54:26.6894051Z 6c503e779d6f: Pulling fs layer 2025-12-04T08:54:26.6894413Z f7e9a021f0ee: Pulling fs layer 2025-12-04T08:54:26.6894758Z 029495b23122: Waiting 2025-12-04T08:54:26.6895094Z 8e023b349080: Pulling fs layer 2025-12-04T08:54:26.6895440Z d0fb85b00833: Waiting 2025-12-04T08:54:26.6895743Z 59b639308833: Waiting 2025-12-04T08:54:26.6896205Z 8188df80e595: Pulling fs layer 2025-12-04T08:54:26.6896560Z 3c2c2f8c74bf: Pulling fs layer 2025-12-04T08:54:26.6896903Z 2b5a11b41761: Waiting 2025-12-04T08:54:26.6897222Z 2aa7784fbe33: Pulling fs layer 2025-12-04T08:54:26.6897581Z 2b3b5215d3eb: Pulling fs layer 2025-12-04T08:54:26.6897920Z dc112c89d57a: Waiting 2025-12-04T08:54:26.6898225Z 522eab2402e5: Waiting 2025-12-04T08:54:26.6898546Z 99b1f1ea3e85: Pulling fs layer 2025-12-04T08:54:26.6899142Z 18d6daba0a57: Pulling fs layer 2025-12-04T08:54:26.6899484Z 5bfdaeb5578d: Waiting 2025-12-04T08:54:26.6899806Z 5277f2a503eb: Pulling fs layer 2025-12-04T08:54:26.6900151Z 8165374f8dcc: Waiting 2025-12-04T08:54:26.6900477Z 3198a9717aac: Pulling fs layer 2025-12-04T08:54:26.6900817Z 73e33534e9eb: Waiting 2025-12-04T08:54:26.6901123Z 1aecc77354ce: Waiting 2025-12-04T08:54:26.6901429Z 465d3fd643aa: Waiting 2025-12-04T08:54:26.6901730Z c07d27e4d3a5: Waiting 2025-12-04T08:54:26.6902045Z 99a4918e5808: Pulling fs layer 2025-12-04T08:54:26.6902387Z 6c503e779d6f: Waiting 2025-12-04T08:54:26.6902704Z 15bb11dfc6ac: Pulling fs layer 2025-12-04T08:54:26.6903043Z 9681563a88ff: Waiting 2025-12-04T08:54:26.6903358Z bd87c8766e90: Pulling fs layer 2025-12-04T08:54:26.6903698Z 3c2c2f8c74bf: Waiting 2025-12-04T08:54:26.6904002Z f7e9a021f0ee: Waiting 2025-12-04T08:54:26.6904306Z 99b1f1ea3e85: Waiting 2025-12-04T08:54:26.6917266Z 1969e15d0c13: Pulling fs layer 2025-12-04T08:54:26.6918942Z 2aa7784fbe33: Waiting 2025-12-04T08:54:26.6919603Z 18d6daba0a57: Waiting 2025-12-04T08:54:26.6919942Z 8e023b349080: Waiting 2025-12-04T08:54:26.6920254Z 2b3b5215d3eb: Waiting 2025-12-04T08:54:26.6920566Z 5277f2a503eb: Waiting 2025-12-04T08:54:26.6920890Z 3198a9717aac: Waiting 2025-12-04T08:54:26.6921228Z 24a03847d382: Pulling fs layer 2025-12-04T08:54:26.6921599Z 816e2e34e018: Pulling fs layer 2025-12-04T08:54:26.6921947Z b21856d1bf42: Waiting 2025-12-04T08:54:26.6922251Z 99a4918e5808: Waiting 2025-12-04T08:54:26.6922546Z 8188df80e595: Waiting 2025-12-04T08:54:26.6922846Z 1969e15d0c13: Waiting 2025-12-04T08:54:26.6923217Z b168858b8537: Pulling fs layer 2025-12-04T08:54:26.6923548Z 24a03847d382: Waiting 2025-12-04T08:54:26.6923846Z cb19d84867e4: Waiting 2025-12-04T08:54:26.6924168Z 6b8d5ff02e26: Pulling fs layer 2025-12-04T08:54:26.6924511Z 816e2e34e018: Waiting 2025-12-04T08:54:26.6924810Z bd87c8766e90: Waiting 2025-12-04T08:54:26.6925124Z 4e3b10a5dd6a: Pulling fs layer 2025-12-04T08:54:26.6925477Z 3092fab73b59: Pulling fs layer 2025-12-04T08:54:26.6925830Z 20020dd28a15: Pulling fs layer 2025-12-04T08:54:26.6926280Z 15bb11dfc6ac: Waiting 2025-12-04T08:54:26.6926587Z 3092fab73b59: Waiting 2025-12-04T08:54:26.6926889Z 6b8d5ff02e26: Waiting 2025-12-04T08:54:26.6927192Z b168858b8537: Waiting 2025-12-04T08:54:26.6927507Z ae5280ce969d: Pulling fs layer 2025-12-04T08:54:26.6927846Z 20020dd28a15: Waiting 2025-12-04T08:54:26.6928143Z ae5280ce969d: Waiting 2025-12-04T08:54:26.6928444Z 4e3b10a5dd6a: Waiting 2025-12-04T08:54:26.6928757Z 4f4fb700ef54: Pulling fs layer 2025-12-04T08:54:26.6929118Z fe17d9eb0fd2: Pulling fs layer 2025-12-04T08:54:26.6929472Z a51e0dab2d59: Pulling fs layer 2025-12-04T08:54:26.6929828Z 6eb176cefd72: Pulling fs layer 2025-12-04T08:54:26.6930184Z e7b8cf2e8d5a: Pulling fs layer 2025-12-04T08:54:26.6930529Z ef3a5060abce: Pulling fs layer 2025-12-04T08:54:26.6930867Z fe17d9eb0fd2: Waiting 2025-12-04T08:54:26.6931172Z a51e0dab2d59: Waiting 2025-12-04T08:54:26.6931489Z a6f4ec14b42b: Pulling fs layer 2025-12-04T08:54:26.6931840Z 7e5a0c956cfb: Pulling fs layer 2025-12-04T08:54:26.6932186Z 4f4fb700ef54: Waiting 2025-12-04T08:54:26.6932501Z b4f78730cfe7: Pulling fs layer 2025-12-04T08:54:26.6932842Z 6eb176cefd72: Waiting 2025-12-04T08:54:26.6934099Z 081028f24389: Pulling fs layer 2025-12-04T08:54:26.6934443Z ef3a5060abce: Waiting 2025-12-04T08:54:26.6934747Z e7b8cf2e8d5a: Waiting 2025-12-04T08:54:26.6935052Z a6f4ec14b42b: Waiting 2025-12-04T08:54:26.6935347Z 7e5a0c956cfb: Waiting 2025-12-04T08:54:26.6935669Z a534dcf4b9a9: Pulling fs layer 2025-12-04T08:54:26.6936128Z 2e77500302cc: Pulling fs layer 2025-12-04T08:54:26.6936461Z b4f78730cfe7: Waiting 2025-12-04T08:54:26.6936774Z bc08246bb4ba: Pulling fs layer 2025-12-04T08:54:26.6937114Z 2e77500302cc: Waiting 2025-12-04T08:54:26.6937427Z ff0c473ca120: Pulling fs layer 2025-12-04T08:54:26.6937771Z a534dcf4b9a9: Waiting 2025-12-04T08:54:26.6938088Z 6bbc14b250ef: Pulling fs layer 2025-12-04T08:54:26.6938430Z bc08246bb4ba: Waiting 2025-12-04T08:54:26.6938723Z ff0c473ca120: Waiting 2025-12-04T08:54:26.6939244Z 081028f24389: Waiting 2025-12-04T08:54:26.6939533Z 6bbc14b250ef: Waiting 2025-12-04T08:54:27.2843089Z 835841cca3b7: Verifying Checksum 2025-12-04T08:54:27.2843486Z 835841cca3b7: Download complete 2025-12-04T08:54:27.8701016Z 029495b23122: Verifying Checksum 2025-12-04T08:54:27.8701950Z 029495b23122: Download complete 2025-12-04T08:54:28.3558325Z 63e5bc7682b8: Verifying Checksum 2025-12-04T08:54:28.3559420Z 63e5bc7682b8: Download complete 2025-12-04T08:54:28.4564061Z d0fb85b00833: Verifying Checksum 2025-12-04T08:54:28.4564851Z d0fb85b00833: Download complete 2025-12-04T08:54:29.0244323Z 63e5bc7682b8: Pull complete 2025-12-04T08:54:29.0253993Z 59b639308833: Verifying Checksum 2025-12-04T08:54:29.0254264Z 59b639308833: Download complete 2025-12-04T08:54:29.0412609Z 835841cca3b7: Pull complete 2025-12-04T08:54:29.6085873Z 522eab2402e5: Verifying Checksum 2025-12-04T08:54:29.6086397Z 522eab2402e5: Download complete 2025-12-04T08:54:30.1954853Z 2b5a11b41761: Verifying Checksum 2025-12-04T08:54:30.1955829Z 2b5a11b41761: Download complete 2025-12-04T08:54:30.7910823Z 9681563a88ff: Download complete 2025-12-04T08:54:31.9377266Z dc112c89d57a: Verifying Checksum 2025-12-04T08:54:31.9378221Z dc112c89d57a: Download complete 2025-12-04T08:54:32.5597704Z 5bfdaeb5578d: Verifying Checksum 2025-12-04T08:54:32.5597944Z 5bfdaeb5578d: Download complete 2025-12-04T08:54:33.5074310Z c07d27e4d3a5: Verifying Checksum 2025-12-04T08:54:33.5075414Z c07d27e4d3a5: Download complete 2025-12-04T08:54:34.1361983Z b21856d1bf42: Verifying Checksum 2025-12-04T08:54:34.1362211Z b21856d1bf42: Download complete 2025-12-04T08:54:34.7216422Z cb19d84867e4: Verifying Checksum 2025-12-04T08:54:34.7216999Z cb19d84867e4: Download complete 2025-12-04T08:54:35.3243906Z 8165374f8dcc: Verifying Checksum 2025-12-04T08:54:35.3276928Z 8165374f8dcc: Download complete 2025-12-04T08:54:35.3646188Z aac69780afc8: Verifying Checksum 2025-12-04T08:54:35.3647023Z aac69780afc8: Download complete 2025-12-04T08:54:35.9665745Z 465d3fd643aa: Verifying Checksum 2025-12-04T08:54:35.9665909Z 465d3fd643aa: Download complete 2025-12-04T08:54:36.5650036Z 6c503e779d6f: Download complete 2025-12-04T08:54:37.2248476Z f7e9a021f0ee: Verifying Checksum 2025-12-04T08:54:37.2248759Z f7e9a021f0ee: Download complete 2025-12-04T08:54:37.8401499Z 8e023b349080: Download complete 2025-12-04T08:54:40.7877121Z aac69780afc8: Pull complete 2025-12-04T08:54:40.8030204Z 029495b23122: Pull complete 2025-12-04T08:54:40.8119225Z d0fb85b00833: Pull complete 2025-12-04T08:54:40.8240630Z 59b639308833: Pull complete 2025-12-04T08:54:42.3420244Z dc112c89d57a: Pull complete 2025-12-04T08:54:42.3522036Z 522eab2402e5: Pull complete 2025-12-04T08:54:42.3563054Z 2b5a11b41761: Pull complete 2025-12-04T08:54:42.3607353Z 9681563a88ff: Pull complete 2025-12-04T08:54:47.5077442Z 1aecc77354ce: Verifying Checksum 2025-12-04T08:54:47.5077656Z 1aecc77354ce: Download complete 2025-12-04T08:54:48.1327988Z 3c2c2f8c74bf: Verifying Checksum 2025-12-04T08:54:48.1328191Z 3c2c2f8c74bf: Download complete 2025-12-04T08:54:48.8330962Z 2aa7784fbe33: Download complete 2025-12-04T08:54:49.4578523Z 2b3b5215d3eb: Verifying Checksum 2025-12-04T08:54:49.4578949Z 2b3b5215d3eb: Download complete 2025-12-04T08:56:43.7879230Z 99b1f1ea3e85: Verifying Checksum 2025-12-04T08:56:43.7880231Z 99b1f1ea3e85: Download complete 2025-12-04T08:56:44.3915086Z 18d6daba0a57: Download complete 2025-12-04T08:56:45.0104865Z 5277f2a503eb: Verifying Checksum 2025-12-04T08:56:45.0105318Z 5277f2a503eb: Download complete 2025-12-04T08:56:45.6792574Z 3198a9717aac: Verifying Checksum 2025-12-04T08:56:45.6793123Z 3198a9717aac: Download complete 2025-12-04T08:56:46.3084260Z 99a4918e5808: Verifying Checksum 2025-12-04T08:56:46.3084707Z 99a4918e5808: Download complete 2025-12-04T08:56:47.4912894Z 15bb11dfc6ac: Verifying Checksum 2025-12-04T08:56:47.4913441Z 15bb11dfc6ac: Download complete 2025-12-04T08:56:48.1013590Z bd87c8766e90: Verifying Checksum 2025-12-04T08:56:48.1014069Z bd87c8766e90: Download complete 2025-12-04T08:56:48.7266013Z 1969e15d0c13: Verifying Checksum 2025-12-04T08:56:48.7267120Z 1969e15d0c13: Download complete 2025-12-04T08:56:50.6298294Z 24a03847d382: Verifying Checksum 2025-12-04T08:56:50.6298840Z 24a03847d382: Download complete 2025-12-04T08:56:51.2747083Z 816e2e34e018: Verifying Checksum 2025-12-04T08:56:51.2747558Z 816e2e34e018: Download complete 2025-12-04T08:56:51.8914655Z b168858b8537: Download complete 2025-12-04T08:56:53.8404519Z 6b8d5ff02e26: Verifying Checksum 2025-12-04T08:56:53.8405342Z 6b8d5ff02e26: Download complete 2025-12-04T08:56:54.4453949Z 4e3b10a5dd6a: Download complete 2025-12-04T08:56:55.0380728Z 3092fab73b59: Verifying Checksum 2025-12-04T08:56:55.0381221Z 3092fab73b59: Download complete 2025-12-04T08:56:55.6249819Z 20020dd28a15: Verifying Checksum 2025-12-04T08:56:55.6250343Z 20020dd28a15: Download complete 2025-12-04T08:56:56.2284110Z ae5280ce969d: Download complete 2025-12-04T08:56:56.5366225Z 4f4fb700ef54: Verifying Checksum 2025-12-04T08:56:56.5366713Z 4f4fb700ef54: Download complete 2025-12-04T08:56:57.1232744Z fe17d9eb0fd2: Download complete 2025-12-04T08:56:57.7349761Z a51e0dab2d59: Verifying Checksum 2025-12-04T08:56:57.7349926Z a51e0dab2d59: Download complete 2025-12-04T08:56:58.5202202Z 6eb176cefd72: Verifying Checksum 2025-12-04T08:56:58.5202831Z 6eb176cefd72: Download complete 2025-12-04T08:56:59.1159647Z e7b8cf2e8d5a: Download complete 2025-12-04T08:56:59.7361356Z ef3a5060abce: Verifying Checksum 2025-12-04T08:56:59.7362015Z ef3a5060abce: Download complete 2025-12-04T08:57:00.3098238Z a6f4ec14b42b: Verifying Checksum 2025-12-04T08:57:00.9047408Z 7e5a0c956cfb: Verifying Checksum 2025-12-04T08:57:00.9047971Z 7e5a0c956cfb: Download complete 2025-12-04T09:15:42.4960046Z 73e33534e9eb: Verifying Checksum 2025-12-04T09:15:42.4960271Z 73e33534e9eb: Download complete 2025-12-04T09:15:43.0789451Z 081028f24389: Verifying Checksum 2025-12-04T09:15:43.0794496Z 081028f24389: Download complete 2025-12-04T09:15:43.6644606Z a534dcf4b9a9: Verifying Checksum 2025-12-04T09:15:43.6644824Z a534dcf4b9a9: Download complete 2025-12-04T09:15:48.8280561Z 2e77500302cc: Verifying Checksum 2025-12-04T09:15:48.8280903Z 2e77500302cc: Download complete 2025-12-04T09:15:49.4182127Z bc08246bb4ba: Verifying Checksum 2025-12-04T09:15:49.4182923Z bc08246bb4ba: Download complete 2025-12-04T09:15:50.0098047Z ff0c473ca120: Verifying Checksum 2025-12-04T09:15:50.0098823Z ff0c473ca120: Download complete 2025-12-04T09:15:52.2751023Z 6bbc14b250ef: Verifying Checksum 2025-12-04T09:15:52.2751236Z 6bbc14b250ef: Download complete 2025-12-04T09:16:09.6975806Z 73e33534e9eb: Pull complete 2025-12-04T09:16:09.7014892Z 5bfdaeb5578d: Pull complete 2025-12-04T09:16:09.7143173Z c07d27e4d3a5: Pull complete 2025-12-04T09:16:09.7220869Z b21856d1bf42: Pull complete 2025-12-04T09:16:09.7338590Z cb19d84867e4: Pull complete 2025-12-04T09:16:09.7384891Z 8165374f8dcc: Pull complete 2025-12-04T09:16:14.2852659Z 1aecc77354ce: Pull complete 2025-12-04T09:16:14.2999859Z 465d3fd643aa: Pull complete 2025-12-04T09:16:14.3051917Z 6c503e779d6f: Pull complete 2025-12-04T09:16:14.3197522Z f7e9a021f0ee: Pull complete 2025-12-04T09:16:14.3315659Z 8e023b349080: Pull complete 2025-12-04T09:16:36.5040733Z b4f78730cfe7: Verifying Checksum 2025-12-04T09:16:36.5041274Z b4f78730cfe7: Download complete 2025-12-04T09:21:25.9849229Z 8188df80e595: Verifying Checksum 2025-12-04T09:21:25.9852358Z 8188df80e595: Download complete 2025-12-04T09:22:59.9669279Z 8188df80e595: Pull complete 2025-12-04T09:22:59.9733841Z 3c2c2f8c74bf: Pull complete 2025-12-04T09:22:59.9846848Z 2aa7784fbe33: Pull complete 2025-12-04T09:22:59.9909590Z 2b3b5215d3eb: Pull complete 2025-12-04T09:23:10.7821772Z 99b1f1ea3e85: Pull complete 2025-12-04T09:23:10.7945042Z 18d6daba0a57: Pull complete 2025-12-04T09:23:10.8020597Z 5277f2a503eb: Pull complete 2025-12-04T09:23:10.8091254Z 3198a9717aac: Pull complete 2025-12-04T09:23:10.8149368Z 99a4918e5808: Pull complete 2025-12-04T09:23:10.8614009Z 15bb11dfc6ac: Pull complete 2025-12-04T09:23:10.8716044Z bd87c8766e90: Pull complete 2025-12-04T09:23:10.8776010Z 1969e15d0c13: Pull complete 2025-12-04T09:23:11.2435775Z 24a03847d382: Pull complete 2025-12-04T09:23:11.2492141Z 816e2e34e018: Pull complete 2025-12-04T09:23:11.2562365Z b168858b8537: Pull complete 2025-12-04T09:23:11.4314894Z 6b8d5ff02e26: Pull complete 2025-12-04T09:23:11.4372766Z 4e3b10a5dd6a: Pull complete 2025-12-04T09:23:11.4469948Z 3092fab73b59: Pull complete 2025-12-04T09:23:11.4572593Z 20020dd28a15: Pull complete 2025-12-04T09:23:11.4637207Z ae5280ce969d: Pull complete 2025-12-04T09:23:11.4682822Z 4f4fb700ef54: Pull complete 2025-12-04T09:23:11.4740566Z fe17d9eb0fd2: Pull complete 2025-12-04T09:23:11.4838686Z a51e0dab2d59: Pull complete 2025-12-04T09:23:11.4933913Z 6eb176cefd72: Pull complete 2025-12-04T09:23:11.4973603Z e7b8cf2e8d5a: Pull complete 2025-12-04T09:23:11.5025734Z ef3a5060abce: Pull complete 2025-12-04T09:23:11.5128991Z a6f4ec14b42b: Pull complete 2025-12-04T09:23:11.5198864Z 7e5a0c956cfb: Pull complete 2025-12-04T09:24:20.5548875Z b4f78730cfe7: Pull complete 2025-12-04T09:24:20.5627522Z 081028f24389: Pull complete 2025-12-04T09:24:20.5705293Z a534dcf4b9a9: Pull complete 2025-12-04T09:24:25.7394293Z 2e77500302cc: Pull complete 2025-12-04T09:24:25.7463874Z bc08246bb4ba: Pull complete 2025-12-04T09:24:25.7576510Z ff0c473ca120: Pull complete 2025-12-04T09:24:27.1221843Z 6bbc14b250ef: Pull complete 2025-12-04T09:24:27.1257455Z Digest: sha256:5e190224966743059cf8506170eaec525eada34e38cf646e02d1dbeadfe5a366 2025-12-04T09:24:27.1265309Z Status: Downloaded newer image for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:24:27.1279260Z 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:24:27.1366453Z Prepare all required actions 2025-12-04T09:24:27.1409547Z ##[group]Run ./.github/actions/get-workflow-job-id 2025-12-04T09:24:27.1409991Z with: 2025-12-04T09:24:27.1410669Z github-token: *** 2025-12-04T09:24:27.1410972Z env: 2025-12-04T09:24:27.1411287Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:24:27.1411732Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:24:27.1412313Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:24:27.1412855Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:24:27.1414136Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:24:27.1415374Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:24:27.1415745Z AWS_REGION: us-east-1 2025-12-04T09:24:27.1416206Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:24:27.1416716Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:24:27.1423447Z AWS_SESSION_TOKEN: *** 2025-12-04T09:24:27.1423784Z ##[endgroup] 2025-12-04T09:24:27.1442551Z ##[group]Run set -eux 2025-12-04T09:24:27.1442906Z set -eux 2025-12-04T09:24:27.1443451Z python3 .github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2025-12-04T09:24:27.1452993Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:24:27.1453459Z env: 2025-12-04T09:24:27.1453752Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:24:27.1454187Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:24:27.1454756Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:24:27.1455293Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:24:27.1456606Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:24:27.1457802Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:24:27.1458177Z AWS_REGION: us-east-1 2025-12-04T09:24:27.1458661Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:24:27.1482877Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:24:27.1489710Z AWS_SESSION_TOKEN: *** 2025-12-04T09:24:27.1490542Z GITHUB_TOKEN: *** 2025-12-04T09:24:27.1490860Z ##[endgroup] 2025-12-04T09:24:27.1532862Z + python3 .github/scripts/get_workflow_job_id.py 19922849170 linux.rocm.gpu.gfx942.1.b-gwk9b-runner-vcbrh 2025-12-04T09:24:28.1805660Z Setting output job-id=57116213184 2025-12-04T09:24:28.1806722Z Setting output job-name=linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests, unstable) 2025-12-04T09:24:28.1972673Z Prepare all required actions 2025-12-04T09:24:28.1972885Z Getting action download info 2025-12-04T09:24:28.5923933Z Download action repository 'seemethere/download-artifact-s3@v4' (SHA:1da556a7aa0a088e3153970611f6c432d58e80e6) 2025-12-04T09:24:29.6963911Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-12-04T09:24:30.7791907Z ##[group]Run ./.github/actions/download-build-artifacts 2025-12-04T09:24:30.7792062Z with: 2025-12-04T09:24:30.7792166Z name: linux-jammy-rocm-py3.10 2025-12-04T09:24:30.7792303Z s3-bucket: gha-artifacts 2025-12-04T09:24:30.7792408Z env: 2025-12-04T09:24:30.7792499Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:24:30.7792634Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:24:30.7792812Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:24:30.7792976Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:24:30.7793368Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:24:30.7793741Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:24:30.7793855Z AWS_REGION: us-east-1 2025-12-04T09:24:30.7794007Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:24:30.7794155Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:24:30.7796231Z AWS_SESSION_TOKEN: *** 2025-12-04T09:24:30.7796337Z ##[endgroup] 2025-12-04T09:24:30.7809572Z ##[group]Run seemethere/download-artifact-s3@v4 2025-12-04T09:24:30.7809709Z with: 2025-12-04T09:24:30.7809808Z name: linux-jammy-rocm-py3.10 2025-12-04T09:24:30.7809926Z s3-bucket: gha-artifacts 2025-12-04T09:24:30.7810033Z region: us-east-1 2025-12-04T09:24:30.7810125Z env: 2025-12-04T09:24:30.7810213Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:24:30.7810347Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:24:30.7810524Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:24:30.7810689Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:24:30.7811069Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:24:30.7811439Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:24:30.7811550Z AWS_REGION: us-east-1 2025-12-04T09:24:30.7811695Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:24:30.7811845Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:24:30.7813839Z AWS_SESSION_TOKEN: *** 2025-12-04T09:24:30.7813941Z ##[endgroup] 2025-12-04T09:24:31.0179331Z (node:17221) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023. 2025-12-04T09:24:31.0179946Z 2025-12-04T09:24:31.0180218Z Please migrate your code to use AWS SDK for JavaScript (v3). 2025-12-04T09:24:31.0180928Z For more information, check the migration guide at https://a.co/7PzMCcy 2025-12-04T09:24:31.0181628Z (Use `node --trace-warnings ...` to show where the warning was created) 2025-12-04T09:24:31.3051786Z Found 1 objects with prefix pytorch/pytorch/19922849170/linux-jammy-rocm-py3.10/ 2025-12-04T09:24:31.3052623Z Starting download (1/1): /home/runner/_work/pytorch/pytorch/artifacts.zip 2025-12-04T09:29:01.2561756Z Finished download (1/1): /home/runner/_work/pytorch/pytorch/artifacts.zip 2025-12-04T09:29:01.2572210Z Artifact download has finished successfully 2025-12-04T09:29:01.2814051Z ##[group]Run unzip -o artifacts.zip 2025-12-04T09:29:01.2814219Z unzip -o artifacts.zip 2025-12-04T09:29:01.2818303Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:29:01.2818457Z env: 2025-12-04T09:29:01.2818562Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:01.2818710Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:01.2819080Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:01.2819256Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:01.2819647Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:01.2820029Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:01.2820151Z AWS_REGION: us-east-1 2025-12-04T09:29:01.2820348Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:01.2820502Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:01.2822525Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:01.2822638Z ##[endgroup] 2025-12-04T09:29:01.2870797Z Archive: artifacts.zip 2025-12-04T09:29:01.2870985Z creating: dist/ 2025-12-04T09:29:01.2937953Z inflating: dist/.ninja_log 2025-12-04T09:29:04.2318412Z inflating: dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl 2025-12-04T09:29:04.2319113Z creating: build/ 2025-12-04T09:29:04.2319424Z creating: build/custom_test_artifacts/ 2025-12-04T09:29:04.2337811Z creating: build/custom_test_artifacts/custom-op-build/ 2025-12-04T09:29:04.2338084Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/ 2025-12-04T09:29:04.2338321Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/pkgRedirects/ 2025-12-04T09:29:04.2338623Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T09:29:04.2338885Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/ 2025-12-04T09:29:04.2339189Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T09:29:04.2339456Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T09:29:04.2339710Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T09:29:04.2340019Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T09:29:04.2340312Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T09:29:04.2340593Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T09:29:04.2340862Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T09:29:04.2341124Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T09:29:04.2341457Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T09:29:04.2341761Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T09:29:04.2342040Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T09:29:04.2342343Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T09:29:04.2342665Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T09:29:04.2342941Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeScratch/ 2025-12-04T09:29:04.2343164Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeTmp/ 2025-12-04T09:29:04.2343398Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/cmake.check_cache 2025-12-04T09:29:04.2343638Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/ 2025-12-04T09:29:04.2344274Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.ts 2025-12-04T09:29:04.2344574Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.make 2025-12-04T09:29:04.2345049Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/depend.make 2025-12-04T09:29:04.2345325Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/link.txt 2025-12-04T09:29:04.2345607Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/cmake_clean.cmake 2025-12-04T09:29:04.2345896Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/build.make 2025-12-04T09:29:04.2346240Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/DependInfo.cmake 2025-12-04T09:29:04.2346527Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/flags.make 2025-12-04T09:29:04.2346818Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/progress.make 2025-12-04T09:29:04.2403013Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o.d 2025-12-04T09:29:04.2457074Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o 2025-12-04T09:29:04.2457385Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/ 2025-12-04T09:29:04.2457712Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.ts 2025-12-04T09:29:04.2458023Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.make 2025-12-04T09:29:04.2458321Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/depend.make 2025-12-04T09:29:04.2458610Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/link.txt 2025-12-04T09:29:04.2458910Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/cmake_clean.cmake 2025-12-04T09:29:04.2459206Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/build.make 2025-12-04T09:29:04.2459562Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/DependInfo.cmake 2025-12-04T09:29:04.2459871Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/flags.make 2025-12-04T09:29:04.2460165Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/progress.make 2025-12-04T09:29:04.2465067Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o.d 2025-12-04T09:29:04.2508972Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o 2025-12-04T09:29:04.2509431Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T09:29:04.2509732Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/TargetDirectories.txt 2025-12-04T09:29:04.2510014Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/progress.marks 2025-12-04T09:29:04.2510416Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile2 2025-12-04T09:29:04.2511563Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile.cmake 2025-12-04T09:29:04.2512612Z inflating: build/custom_test_artifacts/custom-op-build/hipblaslt_test_outer_vec.cc 2025-12-04T09:29:04.2512941Z inflating: build/custom_test_artifacts/custom-op-build/hipblaslt_test_vec_ext.cc 2025-12-04T09:29:04.2513194Z inflating: build/custom_test_artifacts/custom-op-build/CMakeCache.txt 2025-12-04T09:29:04.2514302Z inflating: build/custom_test_artifacts/custom-op-build/Makefile 2025-12-04T09:29:04.2514884Z inflating: build/custom_test_artifacts/custom-op-build/cmake_install.cmake 2025-12-04T09:29:04.2606043Z inflating: build/custom_test_artifacts/custom-op-build/libcustom_ops.so 2025-12-04T09:29:04.2636009Z inflating: build/custom_test_artifacts/custom-op-build/test_custom_ops 2025-12-04T09:29:04.2636332Z creating: build/custom_test_artifacts/jit-hook-build/ 2025-12-04T09:29:04.2636833Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/ 2025-12-04T09:29:04.2637076Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/pkgRedirects/ 2025-12-04T09:29:04.2639459Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T09:29:04.2639714Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/ 2025-12-04T09:29:04.2639969Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T09:29:04.2640239Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T09:29:04.2640522Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T09:29:04.2641780Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T09:29:04.2642633Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T09:29:04.2643067Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T09:29:04.2643349Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T09:29:04.2643624Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T09:29:04.2645161Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T09:29:04.2646272Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T09:29:04.2646641Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T09:29:04.2655672Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T09:29:04.2656522Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T09:29:04.2656832Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeScratch/ 2025-12-04T09:29:04.2657060Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeTmp/ 2025-12-04T09:29:04.2657297Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/cmake.check_cache 2025-12-04T09:29:04.2657545Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/ 2025-12-04T09:29:04.2657823Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.ts 2025-12-04T09:29:04.2658151Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.make 2025-12-04T09:29:04.2658448Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/depend.make 2025-12-04T09:29:04.2658728Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/link.txt 2025-12-04T09:29:04.2659019Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/cmake_clean.cmake 2025-12-04T09:29:04.2659309Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/build.make 2025-12-04T09:29:04.2659596Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/DependInfo.cmake 2025-12-04T09:29:04.2659883Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/flags.make 2025-12-04T09:29:04.2660168Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/progress.make 2025-12-04T09:29:04.2696246Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o.d 2025-12-04T09:29:04.2716362Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o 2025-12-04T09:29:04.2716684Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T09:29:04.2717123Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/TargetDirectories.txt 2025-12-04T09:29:04.2717376Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/progress.marks 2025-12-04T09:29:04.2717608Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile2 2025-12-04T09:29:04.2717838Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile.cmake 2025-12-04T09:29:04.2718078Z inflating: build/custom_test_artifacts/jit-hook-build/hipblaslt_test_outer_vec.cc 2025-12-04T09:29:04.2718324Z inflating: build/custom_test_artifacts/jit-hook-build/hipblaslt_test_vec_ext.cc 2025-12-04T09:29:04.2718545Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeCache.txt 2025-12-04T09:29:04.2718753Z inflating: build/custom_test_artifacts/jit-hook-build/Makefile 2025-12-04T09:29:04.2718956Z inflating: build/custom_test_artifacts/jit-hook-build/cmake_install.cmake 2025-12-04T09:29:04.2721613Z inflating: build/custom_test_artifacts/jit-hook-build/test_jit_hooks 2025-12-04T09:29:04.2727379Z creating: build/custom_test_artifacts/custom-backend-build/ 2025-12-04T09:29:04.2728160Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/ 2025-12-04T09:29:04.2728407Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/pkgRedirects/ 2025-12-04T09:29:04.2728684Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T09:29:04.2728953Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/ 2025-12-04T09:29:04.2729245Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T09:29:04.2729527Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T09:29:04.2729799Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T09:29:04.2730113Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T09:29:04.2730421Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T09:29:04.2730715Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T09:29:04.2730999Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T09:29:04.2731276Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T09:29:04.2731610Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T09:29:04.2732447Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T09:29:04.2733050Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T09:29:04.2733380Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T09:29:04.2733719Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T09:29:04.2734016Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeScratch/ 2025-12-04T09:29:04.2734259Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeTmp/ 2025-12-04T09:29:04.2734506Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/cmake.check_cache 2025-12-04T09:29:04.2735340Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/ 2025-12-04T09:29:04.2740019Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.ts 2025-12-04T09:29:04.2740901Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.make 2025-12-04T09:29:04.2741219Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/depend.make 2025-12-04T09:29:04.2741520Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/link.txt 2025-12-04T09:29:04.2741833Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/cmake_clean.cmake 2025-12-04T09:29:04.2742151Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/build.make 2025-12-04T09:29:04.2742466Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/DependInfo.cmake 2025-12-04T09:29:04.2742775Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/flags.make 2025-12-04T09:29:04.2743089Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/progress.make 2025-12-04T09:29:04.2743422Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o.d 2025-12-04T09:29:04.2802203Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o 2025-12-04T09:29:04.2807930Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/ 2025-12-04T09:29:04.2808693Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.ts 2025-12-04T09:29:04.2809051Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.make 2025-12-04T09:29:04.2809410Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/depend.make 2025-12-04T09:29:04.2809730Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/link.txt 2025-12-04T09:29:04.2810066Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/cmake_clean.cmake 2025-12-04T09:29:04.2810399Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/build.make 2025-12-04T09:29:04.2810731Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/DependInfo.cmake 2025-12-04T09:29:04.2811066Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/flags.make 2025-12-04T09:29:04.2811390Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/progress.make 2025-12-04T09:29:04.2816566Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o.d 2025-12-04T09:29:04.2867090Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o 2025-12-04T09:29:04.2867434Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T09:29:04.2867739Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/TargetDirectories.txt 2025-12-04T09:29:04.2868018Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/progress.marks 2025-12-04T09:29:04.2868273Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile2 2025-12-04T09:29:04.2868521Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile.cmake 2025-12-04T09:29:04.2868783Z inflating: build/custom_test_artifacts/custom-backend-build/hipblaslt_test_outer_vec.cc 2025-12-04T09:29:04.2869236Z inflating: build/custom_test_artifacts/custom-backend-build/hipblaslt_test_vec_ext.cc 2025-12-04T09:29:04.2869479Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeCache.txt 2025-12-04T09:29:04.2869699Z inflating: build/custom_test_artifacts/custom-backend-build/Makefile 2025-12-04T09:29:04.2869997Z inflating: build/custom_test_artifacts/custom-backend-build/cmake_install.cmake 2025-12-04T09:29:04.2905073Z inflating: build/custom_test_artifacts/custom-backend-build/libcustom_backend.so 2025-12-04T09:29:04.2936414Z inflating: build/custom_test_artifacts/custom-backend-build/test_custom_backend 2025-12-04T09:29:04.2936596Z creating: build/lib/ 2025-12-04T09:29:04.2972177Z inflating: build/lib/libprotobuf-lite.a 2025-12-04T09:29:04.3216989Z inflating: build/lib/libprotobuf.a 2025-12-04T09:29:04.3491443Z inflating: build/lib/libprotoc.a 2025-12-04T09:29:04.3497638Z inflating: build/lib/libpthreadpool.a 2025-12-04T09:29:04.3502087Z inflating: build/lib/libcpuinfo.a 2025-12-04T09:29:04.3506554Z inflating: build/lib/libcpuinfo_internals.a 2025-12-04T09:29:04.3507458Z inflating: build/lib/libclog.a 2025-12-04T09:29:04.3518017Z inflating: build/lib/libpytorch_qnnpack.a 2025-12-04T09:29:04.3519243Z inflating: build/lib/libnnpack_reference_layers.a 2025-12-04T09:29:04.3529277Z inflating: build/lib/libnnpack.a 2025-12-04T09:29:04.3630786Z inflating: build/lib/libmicrokernels-prod.a 2025-12-04T09:29:04.4100095Z inflating: build/lib/libmicrokernels-all.a 2025-12-04T09:29:04.4138607Z inflating: build/lib/libgtest.a 2025-12-04T09:29:04.4148537Z inflating: build/lib/libgmock.a 2025-12-04T09:29:04.4148668Z inflating: build/lib/libgtest_main.a 2025-12-04T09:29:04.4148919Z inflating: build/lib/libgmock_main.a 2025-12-04T09:29:04.4198917Z inflating: build/lib/libXNNPACK.a 2025-12-04T09:29:04.4240326Z inflating: build/lib/libbenchmark.a 2025-12-04T09:29:04.4240736Z inflating: build/lib/libbenchmark_main.a 2025-12-04T09:29:04.4241149Z inflating: build/lib/libjitprofiling.a 2025-12-04T09:29:04.4245467Z inflating: build/lib/libittnotify.a 2025-12-04T09:29:04.4282084Z inflating: build/lib/libasmjit.a 2025-12-04T09:29:04.4907715Z inflating: build/lib/libfbgemm.a 2025-12-04T09:29:04.4923586Z inflating: build/lib/libtensorpipe_uv.a 2025-12-04T09:29:04.5219532Z inflating: build/lib/libtensorpipe.a 2025-12-04T09:29:04.5285543Z inflating: build/lib/libgloo.a 2025-12-04T09:29:04.5311422Z inflating: build/lib/libonnx_proto.a 2025-12-04T09:29:04.5534121Z inflating: build/lib/libgloo_hip.a 2025-12-04T09:29:04.5928375Z inflating: build/lib/libonnx.a 2025-12-04T09:29:05.1466767Z inflating: build/lib/libdnnl.a 2025-12-04T09:29:05.1478831Z inflating: build/lib/libfmt.a 2025-12-04T09:29:05.1648933Z inflating: build/lib/libkineto.a 2025-12-04T09:29:05.1713936Z inflating: build/lib/libc10.so 2025-12-04T09:29:05.1714779Z inflating: build/lib/libtorch_global_deps.so 2025-12-04T09:29:05.1715786Z inflating: build/lib/libcaffe2_nvrtc.so 2025-12-04T09:29:05.1741166Z inflating: build/lib/libc10_hip.so 2025-12-04T09:29:05.2015437Z inflating: build/lib/libfbgemm_genai.a 2025-12-04T09:29:06.8973367Z inflating: build/lib/libtorch_cpu.so 2025-12-04T09:29:06.8977850Z inflating: build/lib/libshm.so 2025-12-04T09:29:07.7280758Z inflating: build/lib/libtorch_hip.so 2025-12-04T09:29:07.7282600Z inflating: build/lib/libtorch.so 2025-12-04T09:29:07.7293310Z inflating: build/lib/libjitbackend_test.so 2025-12-04T09:29:07.7316755Z inflating: build/lib/libbackend_with_compiler.so 2025-12-04T09:29:07.7346298Z inflating: build/lib/libtorchbind_test.so 2025-12-04T09:29:07.7361246Z inflating: build/lib/libaoti_custom_ops.so 2025-12-04T09:29:07.8655313Z inflating: build/lib/libtorch_python.so 2025-12-04T09:29:07.8675745Z inflating: build/lib/libnnapi_backend.so 2025-12-04T09:29:07.8676386Z creating: build/bin/ 2025-12-04T09:29:07.8676512Z creating: build/bin/CMakeFiles/ 2025-12-04T09:29:07.8677169Z inflating: build/bin/cmake_install.cmake 2025-12-04T09:29:07.8677331Z inflating: build/bin/CTestTestfile.cmake 2025-12-04T09:29:07.8957385Z inflating: build/bin/protoc-3.13.0.0 2025-12-04T09:29:07.9182361Z inflating: build/bin/protoc 2025-12-04T09:29:07.9215704Z inflating: build/bin/c10_AllocatorConfig_test 2025-12-04T09:29:07.9246522Z inflating: build/bin/c10_CompileTimeFunctionPointer_test 2025-12-04T09:29:07.9278096Z inflating: build/bin/c10_DeviceGuard_test 2025-12-04T09:29:07.9310057Z inflating: build/bin/c10_Device_test 2025-12-04T09:29:07.9346461Z inflating: build/bin/c10_DispatchKeySet_test 2025-12-04T09:29:07.9379685Z inflating: build/bin/c10_Scalar_test 2025-12-04T09:29:07.9409868Z inflating: build/bin/c10_StreamGuard_test 2025-12-04T09:29:07.9444736Z inflating: build/bin/c10_SymInt_test 2025-12-04T09:29:07.9478942Z inflating: build/bin/c10_SizesAndStrides_test 2025-12-04T09:29:07.9511828Z inflating: build/bin/c10_Bitset_test 2025-12-04T09:29:07.9554136Z inflating: build/bin/c10_cow_test 2025-12-04T09:29:07.9587296Z inflating: build/bin/c10_InlineDeviceGuard_test 2025-12-04T09:29:07.9621540Z inflating: build/bin/c10_InlineStreamGuard_test 2025-12-04T09:29:07.9652168Z inflating: build/bin/c10_ArrayRef_test 2025-12-04T09:29:07.9682834Z inflating: build/bin/c10_ConstexprCrc_test 2025-12-04T09:29:07.9714160Z inflating: build/bin/c10_DeadlockDetection_test 2025-12-04T09:29:07.9746882Z inflating: build/bin/c10_IntrusiveList_test 2025-12-04T09:29:07.9776996Z inflating: build/bin/c10_Half_test 2025-12-04T09:29:07.9826471Z inflating: build/bin/c10_Enumerate_test 2025-12-04T09:29:07.9846257Z inflating: build/bin/c10_LeftRight_test 2025-12-04T09:29:07.9896967Z inflating: build/bin/c10_NetworkFlow_test 2025-12-04T09:29:07.9918242Z inflating: build/bin/c10_Semaphore_test 2025-12-04T09:29:07.9940897Z inflating: build/bin/c10_Synchronized_test 2025-12-04T09:29:07.9972536Z inflating: build/bin/c10_TypeIndex_test 2025-12-04T09:29:08.0019785Z inflating: build/bin/c10_ThreadLocal_test 2025-12-04T09:29:08.0041585Z inflating: build/bin/c10_accumulate_test 2025-12-04T09:29:08.0096392Z inflating: build/bin/c10_bfloat16_test 2025-12-04T09:29:08.0103143Z inflating: build/bin/c10_error_test 2025-12-04T09:29:08.0134148Z inflating: build/bin/c10_bit_cast_test 2025-12-04T09:29:08.0168013Z inflating: build/bin/c10_complex_test 2025-12-04T09:29:08.0200449Z inflating: build/bin/c10_exception_test 2025-12-04T09:29:08.0235386Z inflating: build/bin/c10_complex_math_test 2025-12-04T09:29:08.0266607Z inflating: build/bin/c10_flags_test 2025-12-04T09:29:08.0298562Z inflating: build/bin/c10_irange_test 2025-12-04T09:29:08.0329867Z inflating: build/bin/c10_generic_math_test 2025-12-04T09:29:08.0419431Z inflating: build/bin/c10_intrusive_ptr_test 2025-12-04T09:29:08.0454686Z inflating: build/bin/c10_logging_test 2025-12-04T09:29:08.0485240Z inflating: build/bin/c10_nofatal_test 2025-12-04T09:29:08.0518249Z inflating: build/bin/c10_lazy_test 2025-12-04T09:29:08.0556369Z inflating: build/bin/c10_ordered_preserving_dict_test 2025-12-04T09:29:08.0588893Z inflating: build/bin/c10_registry_test 2025-12-04T09:29:08.0621161Z inflating: build/bin/c10_ssize_test 2025-12-04T09:29:08.0666280Z inflating: build/bin/c10_optional_test 2025-12-04T09:29:08.0766477Z inflating: build/bin/c10_small_vector_test 2025-12-04T09:29:08.0788180Z inflating: build/bin/c10_string_util_test 2025-12-04T09:29:08.0845677Z inflating: build/bin/c10_tempfile_test 2025-12-04T09:29:08.0857594Z inflating: build/bin/c10_string_view_test 2025-12-04T09:29:08.0886727Z inflating: build/bin/c10_intrusive_ptr_benchmark 2025-12-04T09:29:08.0926405Z inflating: build/bin/c10_typeid_test 2025-12-04T09:29:08.0940707Z inflating: build/bin/c10_hip_HIPAssertionsTest_1_var_test 2025-12-04T09:29:08.0986097Z inflating: build/bin/c10_hip_HIPAssertionsTest_catches_stream 2025-12-04T09:29:08.1006516Z inflating: build/bin/c10_hip_HIPAssertionsTest_catches_thread_and_block_and_device 2025-12-04T09:29:08.1030945Z inflating: build/bin/c10_hip_HIPAssertionsTest_from_2_processes 2025-12-04T09:29:08.1067269Z inflating: build/bin/c10_hip_HIPAssertionsTest_multiple_writes_from_blocks_and_threads 2025-12-04T09:29:08.1115541Z inflating: build/bin/c10_hip_HIPAssertionsTest_multiple_writes_from_multiple_blocks 2025-12-04T09:29:08.1156244Z inflating: build/bin/c10_hip_HIPAssertionsTest_multiple_writes_from_same_block 2025-12-04T09:29:08.1156443Z inflating: build/bin/c10_hip_HIPTest 2025-12-04T09:29:08.1478953Z inflating: build/bin/vec_test_all_types_DEFAULT 2025-12-04T09:29:08.1816528Z inflating: build/bin/vec_test_all_types_AVX512 2025-12-04T09:29:08.2156329Z inflating: build/bin/vec_test_all_types_AVX2 2025-12-04T09:29:08.2213521Z inflating: build/bin/test_aoti_abi_check 2025-12-04T09:29:08.2243883Z inflating: build/bin/test_vec_half_DEFAULT 2025-12-04T09:29:08.2275104Z inflating: build/bin/test_vec_half_AVX2 2025-12-04T09:29:08.2305826Z inflating: build/bin/test_vec_half_AVX512 2025-12-04T09:29:08.2342942Z inflating: build/bin/BackoffTest 2025-12-04T09:29:08.2370612Z inflating: build/bin/FileStoreTest 2025-12-04T09:29:08.2405303Z inflating: build/bin/TCPStoreTest 2025-12-04T09:29:08.2464487Z inflating: build/bin/HashStoreTest 2025-12-04T09:29:08.2478431Z inflating: build/bin/ProcessGroupGlooTest 2025-12-04T09:29:08.2493113Z inflating: build/bin/example_allreduce 2025-12-04T09:29:08.2493262Z inflating: build/bin/torch_shm_manager 2025-12-04T09:29:08.2515348Z inflating: build/bin/static_runtime_bench 2025-12-04T09:29:08.2686124Z inflating: build/bin/static_runtime_test 2025-12-04T09:29:08.2702882Z inflating: build/bin/Dict_test 2025-12-04T09:29:08.2735153Z inflating: build/bin/Dimname_test 2025-12-04T09:29:08.2774380Z inflating: build/bin/MaybeOwned_test 2025-12-04T09:29:08.2809170Z inflating: build/bin/NamedTensor_test 2025-12-04T09:29:08.2845067Z inflating: build/bin/apply_utils_test 2025-12-04T09:29:08.2880981Z inflating: build/bin/atest 2025-12-04T09:29:08.2919818Z inflating: build/bin/basic 2025-12-04T09:29:08.2953091Z inflating: build/bin/broadcast_test 2025-12-04T09:29:08.2984465Z inflating: build/bin/cpu_allocator_test 2025-12-04T09:29:08.3019884Z inflating: build/bin/cpu_generator_test 2025-12-04T09:29:08.3052129Z inflating: build/bin/cpu_profiling_allocator_test 2025-12-04T09:29:08.3107316Z inflating: build/bin/cpu_rng_test 2025-12-04T09:29:08.3139177Z inflating: build/bin/dlconvertor_test 2025-12-04T09:29:08.3174177Z inflating: build/bin/extension_backend_test 2025-12-04T09:29:08.3208152Z inflating: build/bin/half_test 2025-12-04T09:29:08.3265892Z inflating: build/bin/ivalue_test 2025-12-04T09:29:08.3296606Z inflating: build/bin/lazy_tensor_test 2025-12-04T09:29:08.3329250Z inflating: build/bin/math_kernel_test 2025-12-04T09:29:08.3361482Z inflating: build/bin/memory_format_test 2025-12-04T09:29:08.3394205Z inflating: build/bin/memory_overlapping_test 2025-12-04T09:29:08.3426952Z inflating: build/bin/mobile_memory_cleanup 2025-12-04T09:29:08.3461892Z inflating: build/bin/native_test 2025-12-04T09:29:08.3493551Z inflating: build/bin/operator_name_test 2025-12-04T09:29:08.3524999Z inflating: build/bin/operators_test 2025-12-04T09:29:08.3557081Z inflating: build/bin/packedtensoraccessor_test 2025-12-04T09:29:08.3598117Z inflating: build/bin/pow_test 2025-12-04T09:29:08.3633023Z inflating: build/bin/quantized_test 2025-12-04T09:29:08.3663978Z inflating: build/bin/reduce_ops_test 2025-12-04T09:29:08.3695438Z inflating: build/bin/reportMemoryUsage_test 2025-12-04T09:29:08.3729420Z inflating: build/bin/scalar_tensor_test 2025-12-04T09:29:08.3764836Z inflating: build/bin/scalar_test 2025-12-04T09:29:08.3796804Z inflating: build/bin/StorageUtils_test 2025-12-04T09:29:08.3828869Z inflating: build/bin/stride_properties_test 2025-12-04T09:29:08.3876366Z inflating: build/bin/tensor_iterator_test 2025-12-04T09:29:08.3909897Z inflating: build/bin/test_parallel 2025-12-04T09:29:08.3941499Z inflating: build/bin/thread_init_test 2025-12-04T09:29:08.3975194Z inflating: build/bin/type_ptr_test 2025-12-04T09:29:08.4011767Z inflating: build/bin/type_test 2025-12-04T09:29:08.4043942Z inflating: build/bin/undefined_tensor_test 2025-12-04T09:29:08.4075055Z inflating: build/bin/verify_api_visibility 2025-12-04T09:29:08.4118125Z inflating: build/bin/legacy_vmap_test 2025-12-04T09:29:08.4149867Z inflating: build/bin/weakref_test 2025-12-04T09:29:08.4181781Z inflating: build/bin/wrapdim_test 2025-12-04T09:29:08.4243650Z inflating: build/bin/List_test 2025-12-04T09:29:08.4275523Z inflating: build/bin/xla_tensor_test 2025-12-04T09:29:08.4311813Z inflating: build/bin/IListRef_test 2025-12-04T09:29:08.4382012Z inflating: build/bin/kernel_function_legacy_test 2025-12-04T09:29:08.4422093Z inflating: build/bin/KernelFunction_test 2025-12-04T09:29:08.4478902Z inflating: build/bin/kernel_function_test 2025-12-04T09:29:08.4565541Z inflating: build/bin/kernel_lambda_legacy_test 2025-12-04T09:29:08.4612347Z inflating: build/bin/kernel_lambda_test 2025-12-04T09:29:08.4648810Z inflating: build/bin/kernel_stackbased_test 2025-12-04T09:29:08.4705337Z inflating: build/bin/make_boxed_from_unboxed_functor_test 2025-12-04T09:29:08.4737018Z inflating: build/bin/CppSignature_test 2025-12-04T09:29:08.4767353Z inflating: build/bin/op_allowlist_test 2025-12-04T09:29:08.4944175Z inflating: build/bin/op_registration_test 2025-12-04T09:29:08.4974602Z inflating: build/bin/hip_complex_math_test 2025-12-04T09:29:08.5008200Z inflating: build/bin/backend_fallback_test 2025-12-04T09:29:08.5038469Z inflating: build/bin/hip_complex_test 2025-12-04T09:29:08.5078957Z inflating: build/bin/inline_container_test 2025-12-04T09:29:08.5111379Z inflating: build/bin/hip_apply_test 2025-12-04T09:29:08.5141729Z inflating: build/bin/hip_distributions_test 2025-12-04T09:29:08.5172029Z inflating: build/bin/hip_generator_test 2025-12-04T09:29:08.5202732Z inflating: build/bin/hip_half_test 2025-12-04T09:29:08.5232846Z inflating: build/bin/hip_integer_divider_test 2025-12-04T09:29:08.5262976Z inflating: build/bin/hip_optional_test 2025-12-04T09:29:08.5293250Z inflating: build/bin/hip_packedtensoraccessor_test 2025-12-04T09:29:08.5323718Z inflating: build/bin/hip_vectorized_test 2025-12-04T09:29:08.5355531Z inflating: build/bin/hip_dlconvertor_test 2025-12-04T09:29:08.5975430Z inflating: build/bin/test_jit 2025-12-04T09:29:08.6175181Z inflating: build/bin/test_lazy 2025-12-04T09:29:08.6208622Z inflating: build/bin/test_dist_autograd 2025-12-04T09:29:08.6250067Z inflating: build/bin/test_cpp_rpc 2025-12-04T09:29:08.6251675Z inflating: build/bin/parallel_benchmark 2025-12-04T09:29:08.6909860Z inflating: build/bin/test_api 2025-12-04T09:29:08.6910087Z creating: .additional_ci_files/ 2025-12-04T09:29:08.6945269Z inflating: .additional_ci_files/test-times.json 2025-12-04T09:29:08.7077281Z inflating: .additional_ci_files/test-class-times.json 2025-12-04T09:29:08.7104644Z ##[group]Run rm artifacts.zip 2025-12-04T09:29:08.7105024Z rm artifacts.zip 2025-12-04T09:29:08.7111036Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:29:08.7111204Z env: 2025-12-04T09:29:08.7111314Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:08.7111461Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:08.7111648Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:08.7111824Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:08.7112378Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:08.7112762Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:08.7112885Z AWS_REGION: us-east-1 2025-12-04T09:29:08.7113101Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:08.7113406Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:08.7115413Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:08.7115528Z ##[endgroup] 2025-12-04T09:29:08.8071139Z ##[group]Run df -H 2025-12-04T09:29:08.8071262Z df -H 2025-12-04T09:29:08.8076134Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:29:08.8076288Z env: 2025-12-04T09:29:08.8076384Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:08.8076524Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:08.8076703Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:08.8076880Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:08.8077261Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:08.8077629Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:08.8077747Z AWS_REGION: us-east-1 2025-12-04T09:29:08.8077930Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:08.8078082Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:08.8080081Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:08.8080186Z ##[endgroup] 2025-12-04T09:29:08.8658481Z Filesystem Size Used Avail Use% Mounted on 2025-12-04T09:29:08.8667864Z overlay 16T 820G 15T 6% / 2025-12-04T09:29:08.8668098Z tmpfs 68M 0 68M 0% /dev 2025-12-04T09:29:08.8668349Z /dev/md0 16T 820G 15T 6% /run 2025-12-04T09:29:08.8668506Z shm 68M 4.1k 68M 1% /dev/shm 2025-12-04T09:29:08.8668695Z amdprj2-k8s_2 5.5T 120G 5.4T 3% /home/runner/pytorch-data 2025-12-04T09:29:08.8668889Z tmpfs 3.3T 13k 3.3T 1% /run/secrets/kubernetes.io/serviceaccount 2025-12-04T09:29:08.8669055Z tmpfs 1.7T 0 1.7T 0% /proc/acpi 2025-12-04T09:29:08.8669195Z tmpfs 1.7T 0 1.7T 0% /proc/scsi 2025-12-04T09:29:08.8669333Z tmpfs 1.7T 0 1.7T 0% /sys/firmware 2025-12-04T09:29:08.8669500Z tmpfs 1.7T 0 1.7T 0% /sys/devices/virtual/powercap 2025-12-04T09:29:08.8689951Z Prepare all required actions 2025-12-04T09:29:08.8690163Z Getting action download info 2025-12-04T09:29:09.2372069Z ##[group]Run ./.github/actions/download-td-artifacts 2025-12-04T09:29:09.2372229Z with: 2025-12-04T09:29:09.2372326Z env: 2025-12-04T09:29:09.2372428Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:09.2372571Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:09.2372754Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:09.2372941Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:09.2373335Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:09.2373714Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:09.2373836Z AWS_REGION: us-east-1 2025-12-04T09:29:09.2374012Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:09.2374184Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:09.2376260Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:09.2376371Z ##[endgroup] 2025-12-04T09:29:09.2401864Z ##[group]Run seemethere/download-artifact-s3@v4 2025-12-04T09:29:09.2402009Z with: 2025-12-04T09:29:09.2402102Z name: td_results 2025-12-04T09:29:09.2402203Z s3-bucket: gha-artifacts 2025-12-04T09:29:09.2402311Z region: us-east-1 2025-12-04T09:29:09.2402406Z env: 2025-12-04T09:29:09.2402500Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:09.2402772Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:09.2402953Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:09.2403119Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:09.2403502Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:09.2403876Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:09.2403991Z AWS_REGION: us-east-1 2025-12-04T09:29:09.2404145Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:09.2404307Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:09.2406405Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:09.2406508Z ##[endgroup] 2025-12-04T09:29:09.4843455Z (node:17265) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023. 2025-12-04T09:29:09.4843667Z 2025-12-04T09:29:09.4843755Z Please migrate your code to use AWS SDK for JavaScript (v3). 2025-12-04T09:29:09.4843970Z For more information, check the migration guide at https://a.co/7PzMCcy 2025-12-04T09:29:09.4844183Z (Use `node --trace-warnings ...` to show where the warning was created) 2025-12-04T09:29:09.7496999Z Found 1 objects with prefix pytorch/pytorch/19922849170/td_results/ 2025-12-04T09:29:09.7516109Z Starting download (1/1): /home/runner/_work/pytorch/pytorch/td_results.json 2025-12-04T09:29:10.1742031Z Finished download (1/1): /home/runner/_work/pytorch/pytorch/td_results.json 2025-12-04T09:29:10.1742302Z Artifact download has finished successfully 2025-12-04T09:29:10.2071656Z ##[group]Run mkdir -p .additional_ci_files 2025-12-04T09:29:10.2071831Z mkdir -p .additional_ci_files 2025-12-04T09:29:10.2072005Z mv td_results.json .additional_ci_files/td_results.json || true 2025-12-04T09:29:10.2076779Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:29:10.2076955Z env: 2025-12-04T09:29:10.2077057Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:10.2077200Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:10.2077382Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:10.2077555Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:10.2078129Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:10.2078505Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:10.2078630Z AWS_REGION: us-east-1 2025-12-04T09:29:10.2078883Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:10.2079045Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:10.2081059Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:10.2081174Z ##[endgroup] 2025-12-04T09:29:10.2143844Z ##[group]Run .github/scripts/parse_ref.py 2025-12-04T09:29:10.2144021Z .github/scripts/parse_ref.py 2025-12-04T09:29:10.2150499Z shell: /usr/bin/bash -e {0} 2025-12-04T09:29:10.2150619Z env: 2025-12-04T09:29:10.2150718Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:10.2150861Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:10.2151042Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:10.2151215Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:10.2151608Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:10.2151975Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:10.2152095Z AWS_REGION: us-east-1 2025-12-04T09:29:10.2152267Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:10.2152443Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:10.2154448Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:10.2154562Z ##[endgroup] 2025-12-04T09:29:10.2273836Z Setting output branch=main 2025-12-04T09:29:10.2344120Z Prepare all required actions 2025-12-04T09:29:10.2344350Z Getting action download info 2025-12-04T09:29:10.4715655Z ##[group]Run ./.github/actions/filter-test-configs 2025-12-04T09:29:10.4715815Z with: 2025-12-04T09:29:10.4716191Z github-token: *** 2025-12-04T09:29:10.4719152Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}]} 2025-12-04T09:29:10.4722338Z job-name: linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests, unstable) 2025-12-04T09:29:10.4722565Z env: 2025-12-04T09:29:10.4722666Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:10.4722810Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:10.4722992Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:10.4723163Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:10.4723548Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:10.4723919Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:10.4724180Z AWS_REGION: us-east-1 2025-12-04T09:29:10.4724466Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:10.4724624Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:10.4726665Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:10.4726777Z ##[endgroup] 2025-12-04T09:29:10.4743167Z ##[group]Run nick-fields/retry@v3.0.0 2025-12-04T09:29:10.4743296Z with: 2025-12-04T09:29:10.4743385Z shell: bash 2025-12-04T09:29:10.4743485Z timeout_minutes: 10 2025-12-04T09:29:10.4743590Z max_attempts: 5 2025-12-04T09:29:10.4743693Z retry_wait_seconds: 30 2025-12-04T09:29:10.4743991Z command: set -eux # PyYAML 6.0 doesn't work with MacOS x86 anymore # This must run on Python-3.7 (AmazonLinux2) so can't use request=3.32.2 python3 -m pip install requests==2.27.1 pyyaml==6.0.2 2025-12-04T09:29:10.4744297Z polling_interval_seconds: 1 2025-12-04T09:29:10.4744417Z warning_on_retry: true 2025-12-04T09:29:10.4744528Z continue_on_error: false 2025-12-04T09:29:10.4744640Z env: 2025-12-04T09:29:10.4744738Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:10.4744888Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:10.4745072Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:10.4745243Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:10.4745631Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:10.4746107Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:10.4746229Z AWS_REGION: us-east-1 2025-12-04T09:29:10.4746369Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:10.4746521Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:10.4748547Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:10.4748706Z GITHUB_TOKEN: *** 2025-12-04T09:29:10.4748807Z ##[endgroup] 2025-12-04T09:29:10.5145620Z + python3 -m pip install requests==2.27.1 pyyaml==6.0.2 2025-12-04T09:29:10.6600092Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T09:29:10.7574248Z Collecting requests==2.27.1 2025-12-04T09:29:10.7936001Z Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB) 2025-12-04T09:29:10.8051461Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.1/63.1 KB 6.3 MB/s eta 0:00:00 2025-12-04T09:29:10.8537004Z Collecting pyyaml==6.0.2 2025-12-04T09:29:10.8578411Z Downloading PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB) 2025-12-04T09:29:10.9021625Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 751.2/751.2 KB 17.4 MB/s eta 0:00:00 2025-12-04T09:29:11.0007321Z Collecting charset-normalizer~=2.0.0 2025-12-04T09:29:11.0062544Z Downloading charset_normalizer-2.0.12-py3-none-any.whl (39 kB) 2025-12-04T09:29:11.0246801Z Collecting idna<4,>=2.5 2025-12-04T09:29:11.0306751Z Downloading idna-3.11-py3-none-any.whl (71 kB) 2025-12-04T09:29:11.0355125Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.0/71.0 KB 13.8 MB/s eta 0:00:00 2025-12-04T09:29:11.0613876Z Collecting urllib3<1.27,>=1.21.1 2025-12-04T09:29:11.0672388Z Downloading urllib3-1.26.20-py2.py3-none-any.whl (144 kB) 2025-12-04T09:29:11.0738677Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 144.2/144.2 KB 26.1 MB/s eta 0:00:00 2025-12-04T09:29:11.0976843Z Collecting certifi>=2017.4.17 2025-12-04T09:29:11.1016780Z Downloading certifi-2025.11.12-py3-none-any.whl (159 kB) 2025-12-04T09:29:11.1087056Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 159.4/159.4 KB 28.4 MB/s eta 0:00:00 2025-12-04T09:29:11.1584070Z Installing collected packages: urllib3, pyyaml, idna, charset-normalizer, certifi, requests 2025-12-04T09:29:11.2545901Z WARNING: The script normalizer is installed in '/home/runner/.local/bin' which is not on PATH. 2025-12-04T09:29:11.2546765Z Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. 2025-12-04T09:29:11.2806721Z Successfully installed certifi-2025.11.12 charset-normalizer-2.0.12 idna-3.11 pyyaml-6.0.2 requests-2.27.1 urllib3-1.26.20 2025-12-04T09:29:11.5144013Z Command completed after 1 attempt(s). 2025-12-04T09:29:11.5240401Z ##[group]Run set -x 2025-12-04T09:29:11.5240530Z set -x 2025-12-04T09:29:11.5240628Z  2025-12-04T09:29:11.5240814Z # Use relative path here as this could be checked out anywhere, not necessarily 2025-12-04T09:29:11.5241007Z # in runner workspace 2025-12-04T09:29:11.5241170Z python3 "${GITHUB_ACTION_PATH}/../../scripts/parse_ref.py" 2025-12-04T09:29:11.5245444Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:29:11.5245608Z env: 2025-12-04T09:29:11.5245713Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:11.5245858Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:11.5246128Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:11.5246306Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:11.5246708Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:11.5247106Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:11.5247231Z AWS_REGION: us-east-1 2025-12-04T09:29:11.5247398Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:11.5247554Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:11.5249572Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:11.5249686Z ##[endgroup] 2025-12-04T09:29:11.5309714Z + python3 /home/runner/_work/pytorch/pytorch/./.github/actions/filter-test-configs/../../scripts/parse_ref.py 2025-12-04T09:29:11.5372905Z Setting output branch=main 2025-12-04T09:29:11.5396000Z ##[group]Run echo "Workflow: ${GITHUB_WORKFLOW}" 2025-12-04T09:29:11.5396204Z echo "Workflow: ${GITHUB_WORKFLOW}" 2025-12-04T09:29:11.5396349Z echo "Job name: ${JOB_NAME}" 2025-12-04T09:29:11.5396476Z  2025-12-04T09:29:11.5396639Z # Use relative path here as this could be checked out anywhere, not necessarily 2025-12-04T09:29:11.5396859Z # in runner workspace 2025-12-04T09:29:11.5397033Z python3 "${GITHUB_ACTION_PATH}/../../scripts/filter_test_configs.py" \ 2025-12-04T09:29:11.5397222Z  --workflow "${GITHUB_WORKFLOW}" \ 2025-12-04T09:29:11.5397362Z  --job-name "${JOB_NAME}" \ 2025-12-04T09:29:11.5400522Z  --test-matrix "{"include": [{"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}]}" \ 2025-12-04T09:29:11.5403751Z  --selected-test-configs "" \ 2025-12-04T09:29:11.5403890Z  --pr-number "${PR_NUMBER}" \ 2025-12-04T09:29:11.5404026Z  --tag "${TAG}" \ 2025-12-04T09:29:11.5404149Z  --event-name "${EVENT_NAME}" \ 2025-12-04T09:29:11.5404280Z  --schedule "${SCHEDULE}" \ 2025-12-04T09:29:11.5404413Z  --branch "${HEAD_BRANCH}" 2025-12-04T09:29:11.5409237Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:29:11.5409391Z env: 2025-12-04T09:29:11.5409494Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:11.5409639Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:11.5409830Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:11.5410005Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:11.5410407Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:11.5410782Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:11.5410908Z AWS_REGION: us-east-1 2025-12-04T09:29:11.5411097Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:11.5411253Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:11.5413274Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:11.5413495Z GITHUB_TOKEN: *** 2025-12-04T09:29:11.5413701Z JOB_NAME: linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests, unstable) 2025-12-04T09:29:11.5413923Z PR_NUMBER: 2025-12-04T09:29:11.5414022Z TAG: 2025-12-04T09:29:11.5414124Z EVENT_NAME: schedule 2025-12-04T09:29:11.5414235Z SCHEDULE: 29 8 * * * 2025-12-04T09:29:11.5414342Z HEAD_BRANCH: main 2025-12-04T09:29:11.5414455Z ##[endgroup] 2025-12-04T09:29:11.5439058Z Workflow: trunk-rocm-mi300 2025-12-04T09:29:11.5439410Z Job name: linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests, unstable) 2025-12-04T09:29:12.1175135Z INFO:root:Issue https://github.com/pytorch/pytorch/issues/167616 created by jithunnair-amd has unstable all the test jobs for trunk-rocm-mi300 / linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests, unstable) 2025-12-04T09:29:12.3985809Z Setting output keep-going=True 2025-12-04T09:29:12.3986559Z Setting output ci-verbose-test-logs=False 2025-12-04T09:29:12.3987075Z Setting output ci-test-showlocals=False 2025-12-04T09:29:12.3987524Z Setting output ci-no-test-timeout=False 2025-12-04T09:29:12.3987958Z Setting output ci-no-td=False 2025-12-04T09:29:12.3988361Z Setting output ci-td-distributed=False 2025-12-04T09:29:12.3988787Z Setting output is-unstable=True 2025-12-04T09:29:12.3989185Z Setting output reenabled-issues= 2025-12-04T09:29:12.4013036Z Setting output test-matrix={"include": [{"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}]} 2025-12-04T09:29:12.4036043Z Setting output is-test-matrix-empty=False 2025-12-04T09:29:12.4166823Z ##[group]Run echo "Filtered matrix:" 2025-12-04T09:29:12.4167330Z echo "Filtered matrix:" 2025-12-04T09:29:12.4187454Z echo "{"include": [{"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 6, "runner": "linux.rocm.gpu.gfx942.1.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 1, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 2, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "mem_leak_check": "mem_leak_check", "unstable": "unstable", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable", "mem_leak_check": "mem_leak_check"}, {"config": "distributed", "shard": 3, "num_shards": 3, "runner": "linux.rocm.gpu.gfx942.4.b", "rerun_disabled_tests": "rerun_disabled_tests", "unstable": "unstable"}]}" 2025-12-04T09:29:12.4201821Z  2025-12-04T09:29:12.4201998Z echo 2025-12-04T09:29:12.4202223Z echo "Is the current job unstable? True" 2025-12-04T09:29:12.4202481Z  2025-12-04T09:29:12.4202648Z echo 2025-12-04T09:29:12.4202860Z echo "Is keep-going label set? True" 2025-12-04T09:29:12.4203109Z  2025-12-04T09:29:12.4203272Z echo 2025-12-04T09:29:12.4203466Z echo "Reenabled issues? " 2025-12-04T09:29:12.4210446Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:29:12.4210745Z env: 2025-12-04T09:29:12.4210941Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:12.4211255Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:12.4211614Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:12.4211949Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:12.4212743Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:12.4213517Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:12.4213751Z AWS_REGION: us-east-1 2025-12-04T09:29:12.4214042Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:12.4214349Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:12.4218600Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:12.4218811Z ##[endgroup] 2025-12-04T09:29:12.4240008Z Filtered matrix: 2025-12-04T09:29:12.4263088Z {include: [{config: default, shard: 1, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, mem_leak_check: mem_leak_check, unstable: unstable}, {config: default, shard: 1, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, mem_leak_check: mem_leak_check, unstable: unstable, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 1, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable, mem_leak_check: mem_leak_check}, {config: default, shard: 1, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable}, {config: default, shard: 2, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, mem_leak_check: mem_leak_check, unstable: unstable}, {config: default, shard: 2, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, mem_leak_check: mem_leak_check, unstable: unstable, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 2, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable, mem_leak_check: mem_leak_check}, {config: default, shard: 2, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable}, {config: default, shard: 3, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, mem_leak_check: mem_leak_check, unstable: unstable}, {config: default, shard: 3, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, mem_leak_check: mem_leak_check, unstable: unstable, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 3, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable, mem_leak_check: mem_leak_check}, {config: default, shard: 3, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable}, {config: default, shard: 4, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, mem_leak_check: mem_leak_check, unstable: unstable}, {config: default, shard: 4, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, mem_leak_check: mem_leak_check, unstable: unstable, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 4, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable, mem_leak_check: mem_leak_check}, {config: default, shard: 4, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable}, {config: default, shard: 5, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, mem_leak_check: mem_leak_check, unstable: unstable}, {config: default, shard: 5, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, mem_leak_check: mem_leak_check, unstable: unstable, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 5, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable, mem_leak_check: mem_leak_check}, {config: default, shard: 5, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable}, {config: default, shard: 6, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, mem_leak_check: mem_leak_check, unstable: unstable}, {config: default, shard: 6, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, mem_leak_check: mem_leak_check, unstable: unstable, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 6, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable, mem_leak_check: mem_leak_check}, {config: default, shard: 6, num_shards: 6, runner: linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable}, {config: distributed, shard: 1, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, mem_leak_check: mem_leak_check, unstable: unstable}, {config: distributed, shard: 1, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, mem_leak_check: mem_leak_check, unstable: unstable, rerun_disabled_tests: rerun_disabled_tests}, {config: distributed, shard: 1, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable, mem_leak_check: mem_leak_check}, {config: distributed, shard: 1, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable}, {config: distributed, shard: 2, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, mem_leak_check: mem_leak_check, unstable: unstable}, {config: distributed, shard: 2, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, mem_leak_check: mem_leak_check, unstable: unstable, rerun_disabled_tests: rerun_disabled_tests}, {config: distributed, shard: 2, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable, mem_leak_check: mem_leak_check}, {config: distributed, shard: 2, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable}, {config: distributed, shard: 3, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, mem_leak_check: mem_leak_check, unstable: unstable}, {config: distributed, shard: 3, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, mem_leak_check: mem_leak_check, unstable: unstable, rerun_disabled_tests: rerun_disabled_tests}, {config: distributed, shard: 3, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable, mem_leak_check: mem_leak_check}, {config: distributed, shard: 3, num_shards: 3, runner: linux.rocm.gpu.gfx942.4.b, rerun_disabled_tests: rerun_disabled_tests, unstable: unstable}]} 2025-12-04T09:29:12.4276647Z 2025-12-04T09:29:12.4277099Z Is the current job unstable? True 2025-12-04T09:29:12.4277201Z 2025-12-04T09:29:12.4277257Z Is keep-going label set? True 2025-12-04T09:29:12.4277339Z 2025-12-04T09:29:12.4277383Z Reenabled issues? 2025-12-04T09:29:12.4333644Z ##[group]Run echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2025-12-04T09:29:12.4334312Z echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2025-12-04T09:29:12.4343923Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:29:12.4344398Z env: 2025-12-04T09:29:12.4344703Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:12.4345135Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:12.4345707Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:12.4346314Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:12.4347574Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:12.4348817Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:12.4349191Z AWS_REGION: us-east-1 2025-12-04T09:29:12.4349688Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:12.4350269Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:12.4357006Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:12.4357352Z JOB_TIMEOUT: 300 2025-12-04T09:29:12.4357664Z ##[endgroup] 2025-12-04T09:29:12.4422486Z ##[group]Run env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:29:12.4423145Z env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:29:12.4423731Z env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:29:12.4431946Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:29:12.4432454Z env: 2025-12-04T09:29:12.4432764Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:12.4433211Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:12.4433812Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:12.4434355Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:12.4435631Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:12.4436956Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:12.4437330Z AWS_REGION: us-east-1 2025-12-04T09:29:12.4437765Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:12.4438252Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:12.4444967Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:12.4445309Z ##[endgroup] 2025-12-04T09:29:12.4575541Z ##[group]Run set -x 2025-12-04T09:29:12.4576062Z set -x 2025-12-04T09:29:12.4576363Z  2025-12-04T09:29:12.4576708Z if [[ $TEST_CONFIG == 'multigpu' ]]; then 2025-12-04T09:29:12.4577250Z  TEST_COMMAND=.ci/pytorch/multigpu-test.sh 2025-12-04T09:29:12.4577754Z elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then 2025-12-04T09:29:12.4578223Z  TEST_COMMAND=.ci/caffe2/test.sh 2025-12-04T09:29:12.4578613Z else 2025-12-04T09:29:12.4578944Z  TEST_COMMAND=.ci/pytorch/test.sh 2025-12-04T09:29:12.4579336Z fi 2025-12-04T09:29:12.4579609Z  2025-12-04T09:29:12.4580043Z # detached container should get cleaned up by teardown_ec2_linux 2025-12-04T09:29:12.4580695Z # TODO: Stop building test binaries as part of the build phase 2025-12-04T09:29:12.4581277Z # Used for GPU_FLAG since that doesn't play nice 2025-12-04T09:29:12.4581794Z # shellcheck disable=SC2086,SC2090 2025-12-04T09:29:12.4582225Z container_name=$(docker run \ 2025-12-04T09:29:12.4582632Z  ${GPU_FLAG:-} \ 2025-12-04T09:29:12.4583013Z  -e BUILD_ENVIRONMENT \ 2025-12-04T09:29:12.4583625Z  -e PR_NUMBER \ 2025-12-04T09:29:12.4583985Z  -e GITHUB_ACTIONS \ 2025-12-04T09:29:12.4584360Z  -e GITHUB_REPOSITORY \ 2025-12-04T09:29:12.4584745Z  -e GITHUB_WORKFLOW \ 2025-12-04T09:29:12.4585115Z  -e GITHUB_JOB \ 2025-12-04T09:29:12.4585470Z  -e GITHUB_RUN_ID \ 2025-12-04T09:29:12.4585830Z  -e GITHUB_RUN_NUMBER \ 2025-12-04T09:29:12.4586309Z  -e GITHUB_RUN_ATTEMPT \ 2025-12-04T09:29:12.4586692Z  -e JOB_ID \ 2025-12-04T09:29:12.4587031Z  -e JOB_NAME \ 2025-12-04T09:29:12.4587378Z  -e BASE_SHA \ 2025-12-04T09:29:12.4587706Z  -e BRANCH \ 2025-12-04T09:29:12.4588035Z  -e SHA1 \ 2025-12-04T09:29:12.4588369Z  -e AWS_DEFAULT_REGION \ 2025-12-04T09:29:12.4588757Z  -e IN_WHEEL_TEST \ 2025-12-04T09:29:12.4589116Z  -e SHARD_NUMBER \ 2025-12-04T09:29:12.4589478Z  -e TEST_CONFIG \ 2025-12-04T09:29:12.4589845Z  -e NUM_TEST_SHARDS \ 2025-12-04T09:29:12.4590228Z  -e REENABLED_ISSUES \ 2025-12-04T09:29:12.4590622Z  -e CONTINUE_THROUGH_ERROR \ 2025-12-04T09:29:12.4591024Z  -e VERBOSE_TEST_LOGS \ 2025-12-04T09:29:12.4591403Z  -e TEST_SHOWLOCALS \ 2025-12-04T09:29:12.4591774Z  -e NO_TEST_TIMEOUT \ 2025-12-04T09:29:12.4592130Z  -e NO_TD \ 2025-12-04T09:29:12.4592502Z  -e MAX_JOBS="$(nproc --ignore=2)" \ 2025-12-04T09:29:12.4592956Z  -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \ 2025-12-04T09:29:12.4593418Z  -e PYTORCH_TEST_RERUN_DISABLED_TESTS \ 2025-12-04T09:29:12.4593853Z  -e TESTS_TO_INCLUDE \ 2025-12-04T09:29:12.4594244Z  -e HUGGING_FACE_HUB_TOKEN \ 2025-12-04T09:29:12.4594645Z  -e DASHBOARD_TAG \ 2025-12-04T09:29:12.4595116Z  --env-file="${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" \ 2025-12-04T09:29:12.4595638Z  --ulimit stack=10485760:83886080 \ 2025-12-04T09:29:12.4596124Z  --ulimit core=0 \ 2025-12-04T09:29:12.4596559Z  --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \ 2025-12-04T09:29:12.4597061Z  --security-opt seccomp=unconfined \ 2025-12-04T09:29:12.4597504Z  --cap-add=SYS_PTRACE \ 2025-12-04T09:29:12.4597889Z  --shm-size="8g" \ 2025-12-04T09:29:12.4598240Z  --tty \ 2025-12-04T09:29:12.4598556Z  --detach \ 2025-12-04T09:29:12.4599405Z  --name="${container_name}" \ 2025-12-04T09:29:12.4599800Z  --user jenkins \ 2025-12-04T09:29:12.4600287Z  -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \ 2025-12-04T09:29:12.4600791Z  -w /var/lib/jenkins/workspace \ 2025-12-04T09:29:12.4601380Z  "${DOCKER_IMAGE}" 2025-12-04T09:29:12.4601722Z ) 2025-12-04T09:29:12.4602049Z # save container name for later step 2025-12-04T09:29:12.4602569Z echo "CONTAINER_NAME=${container_name}" >> "$GITHUB_ENV" 2025-12-04T09:29:12.4603498Z # jenkins user does not have write permission to mounted workspace; work-around by copying within container to jenkins home 2025-12-04T09:29:12.4604634Z docker exec -t "${container_name}" sh -c "cd .. && cp -R workspace pytorch && cd pytorch && pip install dist/*.whl && ${TEST_COMMAND}" 2025-12-04T09:29:12.4614755Z shell: /usr/bin/bash -e {0} 2025-12-04T09:29:12.4615112Z env: 2025-12-04T09:29:12.4615408Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:29:12.4615845Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T09:29:12.4616491Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T09:29:12.4617020Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T09:29:12.4618280Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T09:29:12.4619587Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:29:12.4619950Z AWS_REGION: us-east-1 2025-12-04T09:29:12.4620401Z AWS_ACCESS_KEY_ID: *** 2025-12-04T09:29:12.4620887Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T09:29:12.4627636Z AWS_SESSION_TOKEN: *** 2025-12-04T09:29:12.4628021Z BUILD_ENVIRONMENT: linux-jammy-rocm-py3.10 2025-12-04T09:29:12.4628428Z PR_NUMBER: 2025-12-04T09:29:12.4628752Z GITHUB_REPOSITORY: pytorch/pytorch 2025-12-04T09:29:12.4629160Z GITHUB_WORKFLOW: trunk-rocm-mi300 2025-12-04T09:29:12.4629536Z GITHUB_JOB: test 2025-12-04T09:29:12.4629854Z GITHUB_RUN_ID: 19922849170 2025-12-04T09:29:12.4630205Z GITHUB_RUN_NUMBER: 689 2025-12-04T09:29:12.4630532Z GITHUB_RUN_ATTEMPT: 1 2025-12-04T09:29:12.4630849Z JOB_ID: 57116213184 2025-12-04T09:29:12.4631503Z JOB_NAME: linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests, unstable) 2025-12-04T09:29:12.4632186Z BRANCH: main 2025-12-04T09:29:12.4632538Z SHA1: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:29:12.4633035Z BASE_SHA: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:29:12.4633462Z TEST_CONFIG: default 2025-12-04T09:29:12.4633772Z SHARD_NUMBER: 6 2025-12-04T09:29:12.4634070Z NUM_TEST_SHARDS: 6 2025-12-04T09:29:12.4634377Z REENABLED_ISSUES: 2025-12-04T09:29:12.4634703Z CONTINUE_THROUGH_ERROR: True 2025-12-04T09:29:12.4635067Z VERBOSE_TEST_LOGS: False 2025-12-04T09:29:12.4635407Z TEST_SHOWLOCALS: False 2025-12-04T09:29:12.4635745Z NO_TEST_TIMEOUT: False 2025-12-04T09:29:12.4636106Z NO_TD: False 2025-12-04T09:29:12.4636954Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:29:12.4637912Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: 0 2025-12-04T09:29:12.4638324Z PYTORCH_TEST_RERUN_DISABLED_TESTS: 1 2025-12-04T09:29:12.4638707Z TESTS_TO_INCLUDE: 2025-12-04T09:29:12.4639017Z DASHBOARD_TAG: 2025-12-04T09:29:12.4639477Z HUGGING_FACE_HUB_TOKEN: *** 2025-12-04T09:29:12.4639827Z ##[endgroup] 2025-12-04T09:29:12.4674819Z + [[ default == \m\u\l\t\i\g\p\u ]] 2025-12-04T09:29:12.4675256Z + [[ linux-jammy-rocm-py3.10 == *onnx* ]] 2025-12-04T09:29:12.4675686Z + TEST_COMMAND=.ci/pytorch/test.sh 2025-12-04T09:29:12.4680652Z +++ nproc --ignore=2 2025-12-04T09:29:12.4692476Z ++ docker run --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host -e BUILD_ENVIRONMENT -e PR_NUMBER -e GITHUB_ACTIONS -e GITHUB_REPOSITORY -e GITHUB_WORKFLOW -e GITHUB_JOB -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RUN_ATTEMPT -e JOB_ID -e JOB_NAME -e BASE_SHA -e BRANCH -e SHA1 -e AWS_DEFAULT_REGION -e IN_WHEEL_TEST -e SHARD_NUMBER -e TEST_CONFIG -e NUM_TEST_SHARDS -e REENABLED_ISSUES -e CONTINUE_THROUGH_ERROR -e VERBOSE_TEST_LOGS -e TEST_SHOWLOCALS -e NO_TEST_TIMEOUT -e NO_TD -e MAX_JOBS=126 -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK -e PYTORCH_TEST_RERUN_DISABLED_TESTS -e TESTS_TO_INCLUDE -e HUGGING_FACE_HUB_TOKEN -e DASHBOARD_TAG --env-file=/home/runner/_work/_temp/github_env_19922849170 --ulimit stack=10485760:83886080 --ulimit core=0 --env-file=/tmp/github_env_19922849170 --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --shm-size=8g --tty --detach --name= --user jenkins -v /home/runner/_work/pytorch/pytorch:/var/lib/jenkins/workspace -w /var/lib/jenkins/workspace 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-rocm-n-py3-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:29:12.7057016Z + container_name=0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7 2025-12-04T09:29:12.7057400Z + echo CONTAINER_NAME=0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7 2025-12-04T09:29:12.7057860Z + docker exec -t 0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7 sh -c 'cd .. && cp -R workspace pytorch && cd pytorch && pip install dist/*.whl && .ci/pytorch/test.sh' 2025-12-04T09:29:15.9659181Z Processing ./dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl 2025-12-04T09:29:16.5105287Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f) (3.18.0) 2025-12-04T09:29:16.5138029Z Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f) (4.12.2) 2025-12-04T09:29:16.5138456Z Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f) (1.13.3) 2025-12-04T09:29:16.5138894Z Requirement already satisfied: networkx>=2.5.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f) (2.8.8) 2025-12-04T09:29:16.5139279Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f) (3.1.6) 2025-12-04T09:29:16.5139679Z Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f) (2025.10.0) 2025-12-04T09:29:16.5272774Z Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch==2.10.0a0+gitffd9b0f) (1.3.0) 2025-12-04T09:29:16.5296374Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch==2.10.0a0+gitffd9b0f) (3.0.3) 2025-12-04T09:29:16.7253105Z Installing collected packages: torch 2025-12-04T09:29:22.3239153Z Successfully installed torch-2.10.0a0+gitffd9b0f 2025-12-04T09:29:22.3653889Z + export TERM=vt100 2025-12-04T09:29:22.3664973Z + TERM=vt100 2025-12-04T09:29:22.3665117Z ++ dirname .ci/pytorch/test.sh 2025-12-04T09:29:22.3665246Z + source .ci/pytorch/common.sh 2025-12-04T09:29:22.3665369Z +++ dirname .ci/pytorch/common.sh 2025-12-04T09:29:22.3687550Z ++ source .ci/pytorch/common_utils.sh 2025-12-04T09:29:22.3688080Z +++ declare -f -t trap_add 2025-12-04T09:29:22.3726253Z ++ set -ex -o pipefail 2025-12-04T09:29:22.3726402Z ++ [[ linux-jammy-rocm-py3.10 == *rocm* ]] 2025-12-04T09:29:22.3726597Z ++ unset HIP_PLATFORM 2025-12-04T09:29:22.3726734Z ++ export PYTORCH_TEST_WITH_ROCM=1 2025-12-04T09:29:22.3726858Z ++ PYTORCH_TEST_WITH_ROCM=1 2025-12-04T09:29:22.3726972Z ++ BUILD_TEST_LIBTORCH=0 2025-12-04T09:29:22.3727084Z ++ dirname .ci/pytorch/test.sh 2025-12-04T09:29:22.3727249Z + source .ci/pytorch/common-build.sh 2025-12-04T09:29:22.3727460Z ++ [[ linux-jammy-rocm-py3.10 != *win-* ]] 2025-12-04T09:29:22.3732823Z ++++ dirname .ci/pytorch/common-build.sh 2025-12-04T09:29:22.3748608Z +++ cd .ci/pytorch 2025-12-04T09:29:22.3749517Z +++ pwd -P 2025-12-04T09:29:22.3750260Z ++ script_dir=/var/lib/jenkins/pytorch/.ci/pytorch 2025-12-04T09:29:22.3750485Z ++ [[ linux-jammy-rocm-py3.10 == *-pch* ]] 2025-12-04T09:29:22.3750623Z ++ which sccache 2025-12-04T09:29:22.3750730Z ++ [[ -z '' ]] 2025-12-04T09:29:22.3750855Z ++ unset SCCACHE_BUCKET 2025-12-04T09:29:22.3750961Z ++ unset SCCACHE_REGION 2025-12-04T09:29:22.3751079Z ++ sccache --stop-server 2025-12-04T09:29:22.3762777Z ++ true 2025-12-04T09:29:22.3765454Z ++ rm -f /var/lib/jenkins/sccache_error.log 2025-12-04T09:29:22.3807540Z ++ trap_add sccache_epilogue EXIT 2025-12-04T09:29:22.3836148Z ++ trap_add_cmd=sccache_epilogue 2025-12-04T09:29:22.3836276Z ++ shift 2025-12-04T09:29:22.3838885Z ++ for trap_add_name in "$@" 2025-12-04T09:29:22.3839068Z ++++ trap -p EXIT 2025-12-04T09:29:22.3850118Z +++ eval 'extract_trap_cmd ' 2025-12-04T09:29:22.3850282Z ++++ extract_trap_cmd 2025-12-04T09:29:22.3850390Z ++++ printf '%s\n' '' 2025-12-04T09:29:22.3850502Z +++ printf '%s\n' sccache_epilogue 2025-12-04T09:29:22.3850625Z ++ trap -- ' 2025-12-04T09:29:22.3850762Z sccache_epilogue' EXIT 2025-12-04T09:29:22.3850866Z ++ [[ -n '' ]] 2025-12-04T09:29:22.3850977Z ++ [[ linux-jammy-rocm-py3.10 == *rocm* ]] 2025-12-04T09:29:22.3851135Z ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T09:29:22.3851798Z ++ SCCACHE_IDLE_TIMEOUT=0 2025-12-04T09:29:22.3851910Z ++ sccache --start-server 2025-12-04T09:29:22.3852025Z sccache: Starting the server... 2025-12-04T09:29:22.4987642Z sccache: Listening on address 127.0.0.1:4226 2025-12-04T09:29:22.5030572Z ++ sccache --zero-stats 2025-12-04T09:29:22.5033742Z Statistics zeroed. 2025-12-04T09:29:22.5036998Z ++ which ccache 2025-12-04T09:29:22.5043765Z + [[ linux-jammy-rocm-py3.10 != *rocm* ]] 2025-12-04T09:29:22.5044496Z + [[ linux-jammy-rocm-py3.10 == *cuda* ]] 2025-12-04T09:29:22.5044701Z + echo 'Environment variables:' 2025-12-04T09:29:22.5044833Z Environment variables: 2025-12-04T09:29:22.5044978Z + env 2025-12-04T09:29:22.5051958Z GITHUB_WORKSPACE=/home/runner/_work/pytorch/pytorch 2025-12-04T09:29:22.5052125Z CONTINUE_THROUGH_ERROR=True 2025-12-04T09:29:22.5052284Z BUILD_ENVIRONMENT=linux-jammy-rocm-py3.10 2025-12-04T09:29:22.5052454Z HOSTNAME=linux.rocm.gpu.gfx942.1.b-gwk9b-runner-vcbrh 2025-12-04T09:29:22.5052716Z GITHUB_PATH=/home/runner/_work/_temp/_runner_file_commands/add_path_b9bfcac5-744a-4881-8bf8-670b1e229ae8 2025-12-04T09:29:22.5052967Z GITHUB_ACTION=__run_2 2025-12-04T09:29:22.5053084Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2025-12-04T09:29:22.5053210Z GITHUB_RUN_NUMBER=689 2025-12-04T09:29:22.5053314Z TEST_CONFIG=default 2025-12-04T09:29:22.5053450Z RUNNER_NAME=linux.rocm.gpu.gfx942.1.b-gwk9b-runner-vcbrh 2025-12-04T09:29:22.5053610Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-12-04T09:29:22.5053747Z AWS_DEFAULT_REGION=us-east-1 2025-12-04T09:29:22.5053889Z RUNNER_ARTIFACT_DIR=/home/runner/_work/_temp/artifacts 2025-12-04T09:29:22.5054044Z GITHUB_TRIGGERING_ACTOR=pytorchmergebot 2025-12-04T09:29:22.5054172Z GITHUB_REF_TYPE=branch 2025-12-04T09:29:22.5054295Z BASE_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:29:22.5054552Z HUGGING_FACE_HUB_TOKEN=*** 2025-12-04T09:29:22.5055034Z *** 2025-12-04T09:29:22.5055133Z GITHUB_REPOSITORY_ID=65600975 2025-12-04T09:29:22.5055249Z GITHUB_ACTIONS=true 2025-12-04T09:29:22.5055367Z SHA1=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:29:22.5055527Z GITHUB_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:29:22.5055750Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/trunk-rocm-mi300.yml@refs/heads/main 2025-12-04T09:29:22.5055994Z UCC_HOME=/usr 2025-12-04T09:29:22.5056104Z RUNNER_ENVIRONMENT=self-hosted 2025-12-04T09:29:22.5056229Z VERBOSE_TEST_LOGS=False 2025-12-04T09:29:22.5056341Z GITHUB_REF=refs/heads/main 2025-12-04T09:29:22.5056448Z RUNNER_OS=Linux 2025-12-04T09:29:22.5056549Z SHARD_NUMBER=6 2025-12-04T09:29:22.5056655Z GITHUB_REF_PROTECTED=true 2025-12-04T09:29:22.5056781Z RUNNER_MANUALLY_TRAP_SIG=1 2025-12-04T09:29:22.5056895Z HOME=/var/lib/jenkins 2025-12-04T09:29:22.5057029Z GITHUB_API_URL=https://api.github.com 2025-12-04T09:29:22.5057426Z PYTORCH_TEST_RERUN_DISABLED_TESTS=1 2025-12-04T09:29:22.5057563Z RUNNER_DOCS_DIR=/home/runner/_work/_temp/docs 2025-12-04T09:29:22.5057695Z LANG=C.UTF-8 2025-12-04T09:29:22.5057818Z UCX_COMMIT=29831d319e6be55cb8c768ca61de335c934ca39e 2025-12-04T09:29:22.5057966Z PYTORCH_TEST_WITH_ROCM=1 2025-12-04T09:29:22.5058122Z RUNNER_TRACKING_ID=github_2e0a6e07-598a-439d-a3fe-d7ef03ba3b52 2025-12-04T09:29:22.5058281Z RUNNER_ARCH=X64 2025-12-04T09:29:22.5058392Z RUNNER_TEMP=/home/runner/_work/_temp 2025-12-04T09:29:22.5058526Z NUM_TEST_SHARDS=6 2025-12-04T09:29:22.5058621Z UCX_HOME=/usr 2025-12-04T09:29:22.5058816Z GITHUB_STATE=/home/runner/_work/_temp/_runner_file_commands/save_state_b9bfcac5-744a-4881-8bf8-670b1e229ae8 2025-12-04T09:29:22.5059132Z JOB_NAME=linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests, unstable) 2025-12-04T09:29:22.5059355Z MAGMA_HOME=/opt/rocm/magma 2025-12-04T09:29:22.5059553Z GITHUB_ENV=/home/runner/_work/_temp/_runner_file_commands/set_env_b9bfcac5-744a-4881-8bf8-670b1e229ae8 2025-12-04T09:29:22.5059797Z GITHUB_EVENT_PATH=/home/runner/_work/_temp/_github_workflow/event.json 2025-12-04T09:29:22.5059968Z GITHUB_EVENT_NAME=schedule 2025-12-04T09:29:22.5060141Z GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT=actions-runner-controller/0.12.1 2025-12-04T09:29:22.5060432Z DASHBOARD_TAG= 2025-12-04T09:29:22.5060528Z GITHUB_RUN_ID=19922849170 2025-12-04T09:29:22.5060741Z GITHUB_STEP_SUMMARY=/home/runner/_work/_temp/_runner_file_commands/step_summary_b9bfcac5-744a-4881-8bf8-670b1e229ae8 2025-12-04T09:29:22.5060965Z GITHUB_ACTOR=pytorchmergebot 2025-12-04T09:29:22.5061076Z PR_NUMBER= 2025-12-04T09:29:22.5061170Z GITHUB_RUN_ATTEMPT=1 2025-12-04T09:29:22.5061277Z ANACONDA_PYTHON_VERSION=3.10 2025-12-04T09:29:22.5061414Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-12-04T09:29:22.5061546Z TERM=vt100 2025-12-04T09:29:22.5061636Z INSTALLED_VISION=yes 2025-12-04T09:29:22.5061737Z BRANCH=main 2025-12-04T09:29:22.5061833Z OPENSSL_ROOT_DIR=/opt/openssl 2025-12-04T09:29:22.5061950Z TESTS_TO_INCLUDE= 2025-12-04T09:29:22.5062111Z GITHUB_ACTION_PATH=/home/runner/_work/pytorch/pytorch/./.github/actions/setup-rocm 2025-12-04T09:29:22.5062302Z GITHUB_SERVER_URL=https://github.com 2025-12-04T09:29:22.5062445Z PYTORCH_ROCM_ARCH=gfx90a;gfx942;gfx950;gfx1100 2025-12-04T09:29:22.5062595Z UCC_COMMIT=9f4b242cbbd8b1462cbc732eb29316cdfa124b77 2025-12-04T09:29:22.5062728Z REENABLED_ISSUES= 2025-12-04T09:29:22.5062826Z SHLVL=1 2025-12-04T09:29:22.5062915Z MAX_JOBS=126 2025-12-04T09:29:22.5063043Z RUNNER_TEST_RESULTS_DIR=/home/runner/_work/_temp/test-results 2025-12-04T09:29:22.5063194Z GITHUB_ACTOR_ID=97764156 2025-12-04T09:29:22.5063316Z RUNNER_TOOL_CACHE=/home/runner/_work/_tool 2025-12-04T09:29:22.5063485Z GITHUB_WORKFLOW_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:29:22.5063644Z GITHUB_REF_NAME=main 2025-12-04T09:29:22.5063754Z ROCM_PATH=/opt/rocm 2025-12-04T09:29:22.5063857Z GITHUB_JOB=test 2025-12-04T09:29:22.5063952Z NO_TEST_TIMEOUT=False 2025-12-04T09:29:22.5064064Z GITHUB_REPOSITORY=pytorch/pytorch 2025-12-04T09:29:22.5064181Z LC_ALL=C.UTF-8 2025-12-04T09:29:22.5064278Z GITHUB_RETENTION_DAYS=90 2025-12-04T09:29:22.5064395Z RUNNER_WORKSPACE=/home/runner/_work/pytorch 2025-12-04T09:29:22.5064529Z OPENSSL_DIR=/opt/openssl 2025-12-04T09:29:22.5064639Z GITHUB_ACTION_REPOSITORY= 2025-12-04T09:29:22.5064997Z PATH=/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:29:22.5065345Z GITHUB_BASE_REF= 2025-12-04T09:29:22.5065442Z CI=true 2025-12-04T09:29:22.5065553Z GITHUB_REPOSITORY_OWNER=pytorch 2025-12-04T09:29:22.5065675Z JOB_ID=57116213184 2025-12-04T09:29:22.5065773Z GITHUB_HEAD_REF= 2025-12-04T09:29:22.5065876Z GITHUB_ACTION_REF= 2025-12-04T09:29:22.5066035Z TEST_SHOWLOCALS=False 2025-12-04T09:29:22.5066150Z GITHUB_WORKFLOW=trunk-rocm-mi300 2025-12-04T09:29:22.5066273Z DEBIAN_FRONTEND=noninteractive 2025-12-04T09:29:22.5066530Z GITHUB_OUTPUT=/home/runner/_work/_temp/_runner_file_commands/set_output_b9bfcac5-744a-4881-8bf8-670b1e229ae8 2025-12-04T09:29:22.5066740Z NO_TD=False 2025-12-04T09:29:22.5066835Z OLDPWD=/var/lib/jenkins 2025-12-04T09:29:22.5066941Z _=/usr/bin/env 2025-12-04T09:29:22.5067076Z ++ python -c 'import site; print(site.getsitepackages()[0])' 2025-12-04T09:29:22.5117025Z + TORCH_INSTALL_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch 2025-12-04T09:29:22.5117264Z + TORCH_BIN_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T09:29:22.5117483Z + TORCH_LIB_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib 2025-12-04T09:29:22.5117702Z + TORCH_TEST_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/test 2025-12-04T09:29:22.5117869Z + BUILD_DIR=build 2025-12-04T09:29:22.5117970Z + BUILD_RENAMED_DIR=build_renamed 2025-12-04T09:29:22.5118088Z + BUILD_BIN_DIR=build/bin 2025-12-04T09:29:22.5118192Z + SHARD_NUMBER=6 2025-12-04T09:29:22.5118290Z + NUM_TEST_SHARDS=6 2025-12-04T09:29:22.5118396Z + export TORCH_SERIALIZATION_DEBUG=1 2025-12-04T09:29:22.5118529Z + TORCH_SERIALIZATION_DEBUG=1 2025-12-04T09:29:22.5118646Z + export VALGRIND=ON 2025-12-04T09:29:22.5118821Z + VALGRIND=ON 2025-12-04T09:29:22.5118929Z + [[ linux-jammy-rocm-py3.10 == *clang9* ]] 2025-12-04T09:29:22.5119069Z + [[ linux-jammy-rocm-py3.10 == *xpu* ]] 2025-12-04T09:29:22.5119190Z + detect_cuda_arch 2025-12-04T09:29:22.5119295Z + [[ linux-jammy-rocm-py3.10 == *cuda* ]] 2025-12-04T09:29:22.5119427Z + [[ linux-jammy-rocm-py3.10 == *s390x* ]] 2025-12-04T09:29:22.5119542Z + [[ 1 == \1 ]] 2025-12-04T09:29:22.5119637Z + ulimit -c 0 2025-12-04T09:29:22.5119737Z + [[ linux-jammy-rocm-py3.10 != *bazel* ]] 2025-12-04T09:29:22.5124016Z ++ realpath build/custom_test_artifacts 2025-12-04T09:29:22.5132360Z + CUSTOM_TEST_ARTIFACT_BUILD_DIR=/var/lib/jenkins/pytorch/build/custom_test_artifacts 2025-12-04T09:29:22.5132876Z + [[ -n '' ]] 2025-12-04T09:29:22.5133293Z + echo 'Environment variables' 2025-12-04T09:29:22.5133485Z Environment variables 2025-12-04T09:29:22.5133599Z + env 2025-12-04T09:29:22.5142937Z GITHUB_WORKSPACE=/home/runner/_work/pytorch/pytorch 2025-12-04T09:29:22.5143177Z CONTINUE_THROUGH_ERROR=True 2025-12-04T09:29:22.5143337Z BUILD_ENVIRONMENT=linux-jammy-rocm-py3.10 2025-12-04T09:29:22.5143509Z HOSTNAME=linux.rocm.gpu.gfx942.1.b-gwk9b-runner-vcbrh 2025-12-04T09:29:22.5143761Z GITHUB_PATH=/home/runner/_work/_temp/_runner_file_commands/add_path_b9bfcac5-744a-4881-8bf8-670b1e229ae8 2025-12-04T09:29:22.5143979Z GITHUB_ACTION=__run_2 2025-12-04T09:29:22.5144104Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2025-12-04T09:29:22.5144227Z GITHUB_RUN_NUMBER=689 2025-12-04T09:29:22.5144340Z TEST_CONFIG=default 2025-12-04T09:29:22.5144482Z RUNNER_NAME=linux.rocm.gpu.gfx942.1.b-gwk9b-runner-vcbrh 2025-12-04T09:29:22.5144646Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-12-04T09:29:22.5144785Z AWS_DEFAULT_REGION=us-east-1 2025-12-04T09:29:22.5144932Z RUNNER_ARTIFACT_DIR=/home/runner/_work/_temp/artifacts 2025-12-04T09:29:22.5145096Z GITHUB_TRIGGERING_ACTOR=pytorchmergebot 2025-12-04T09:29:22.5145224Z GITHUB_REF_TYPE=branch 2025-12-04T09:29:22.5145353Z BASE_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:29:22.5145676Z HUGGING_FACE_HUB_TOKEN=*** 2025-12-04T09:29:22.5145867Z *** 2025-12-04T09:29:22.5146016Z GITHUB_REPOSITORY_ID=65600975 2025-12-04T09:29:22.5146133Z GITHUB_ACTIONS=true 2025-12-04T09:29:22.5146245Z SHA1=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:29:22.5146406Z GITHUB_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:29:22.5146633Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/trunk-rocm-mi300.yml@refs/heads/main 2025-12-04T09:29:22.5146832Z UCC_HOME=/usr 2025-12-04T09:29:22.5146938Z TORCH_SERIALIZATION_DEBUG=1 2025-12-04T09:29:22.5147061Z RUNNER_ENVIRONMENT=self-hosted 2025-12-04T09:29:22.5147180Z VERBOSE_TEST_LOGS=False 2025-12-04T09:29:22.5147290Z GITHUB_REF=refs/heads/main 2025-12-04T09:29:22.5147398Z RUNNER_OS=Linux 2025-12-04T09:29:22.5147496Z SHARD_NUMBER=6 2025-12-04T09:29:22.5147825Z GITHUB_REF_PROTECTED=true 2025-12-04T09:29:22.5147943Z RUNNER_MANUALLY_TRAP_SIG=1 2025-12-04T09:29:22.5148054Z HOME=/var/lib/jenkins 2025-12-04T09:29:22.5148179Z GITHUB_API_URL=https://api.github.com 2025-12-04T09:29:22.5148323Z PYTORCH_TEST_RERUN_DISABLED_TESTS=1 2025-12-04T09:29:22.5148461Z RUNNER_DOCS_DIR=/home/runner/_work/_temp/docs 2025-12-04T09:29:22.5148592Z LANG=C.UTF-8 2025-12-04T09:29:22.5148710Z UCX_COMMIT=29831d319e6be55cb8c768ca61de335c934ca39e 2025-12-04T09:29:22.5148855Z PYTORCH_TEST_WITH_ROCM=1 2025-12-04T09:29:22.5149010Z RUNNER_TRACKING_ID=github_2e0a6e07-598a-439d-a3fe-d7ef03ba3b52 2025-12-04T09:29:22.5149165Z RUNNER_ARCH=X64 2025-12-04T09:29:22.5149276Z RUNNER_TEMP=/home/runner/_work/_temp 2025-12-04T09:29:22.5149403Z NUM_TEST_SHARDS=6 2025-12-04T09:29:22.5149502Z UCX_HOME=/usr 2025-12-04T09:29:22.5149692Z GITHUB_STATE=/home/runner/_work/_temp/_runner_file_commands/save_state_b9bfcac5-744a-4881-8bf8-670b1e229ae8 2025-12-04T09:29:22.5150005Z JOB_NAME=linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1.b, rerun_disabled_tests, unstable) 2025-12-04T09:29:22.5150229Z MAGMA_HOME=/opt/rocm/magma 2025-12-04T09:29:22.5150427Z GITHUB_ENV=/home/runner/_work/_temp/_runner_file_commands/set_env_b9bfcac5-744a-4881-8bf8-670b1e229ae8 2025-12-04T09:29:22.5150803Z GITHUB_EVENT_PATH=/home/runner/_work/_temp/_github_workflow/event.json 2025-12-04T09:29:22.5150972Z GITHUB_EVENT_NAME=schedule 2025-12-04T09:29:22.5151136Z GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT=actions-runner-controller/0.12.1 2025-12-04T09:29:22.5151301Z DASHBOARD_TAG= 2025-12-04T09:29:22.5151403Z GITHUB_RUN_ID=19922849170 2025-12-04T09:29:22.5151622Z GITHUB_STEP_SUMMARY=/home/runner/_work/_temp/_runner_file_commands/step_summary_b9bfcac5-744a-4881-8bf8-670b1e229ae8 2025-12-04T09:29:22.5151854Z GITHUB_ACTOR=pytorchmergebot 2025-12-04T09:29:22.5151972Z PR_NUMBER= 2025-12-04T09:29:22.5152068Z GITHUB_RUN_ATTEMPT=1 2025-12-04T09:29:22.5152174Z VALGRIND=ON 2025-12-04T09:29:22.5152277Z ANACONDA_PYTHON_VERSION=3.10 2025-12-04T09:29:22.5152421Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-12-04T09:29:22.5152560Z TERM=vt100 2025-12-04T09:29:22.5152655Z INSTALLED_VISION=yes 2025-12-04T09:29:22.5152758Z BRANCH=main 2025-12-04T09:29:22.5152889Z OPENSSL_ROOT_DIR=/opt/openssl 2025-12-04T09:29:22.5153005Z TESTS_TO_INCLUDE= 2025-12-04T09:29:22.5153172Z GITHUB_ACTION_PATH=/home/runner/_work/pytorch/pytorch/./.github/actions/setup-rocm 2025-12-04T09:29:22.5153371Z GITHUB_SERVER_URL=https://github.com 2025-12-04T09:29:22.5153517Z PYTORCH_ROCM_ARCH=gfx90a;gfx942;gfx950;gfx1100 2025-12-04T09:29:22.5153676Z UCC_COMMIT=9f4b242cbbd8b1462cbc732eb29316cdfa124b77 2025-12-04T09:29:22.5153818Z REENABLED_ISSUES= 2025-12-04T09:29:22.5153918Z SHLVL=1 2025-12-04T09:29:22.5154011Z MAX_JOBS=126 2025-12-04T09:29:22.5154144Z RUNNER_TEST_RESULTS_DIR=/home/runner/_work/_temp/test-results 2025-12-04T09:29:22.5154302Z GITHUB_ACTOR_ID=97764156 2025-12-04T09:29:22.5154425Z RUNNER_TOOL_CACHE=/home/runner/_work/_tool 2025-12-04T09:29:22.5154584Z GITHUB_WORKFLOW_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:29:22.5154738Z GITHUB_REF_NAME=main 2025-12-04T09:29:22.5154843Z ROCM_PATH=/opt/rocm 2025-12-04T09:29:22.5154948Z GITHUB_JOB=test 2025-12-04T09:29:22.5155049Z NO_TEST_TIMEOUT=False 2025-12-04T09:29:22.5155165Z GITHUB_REPOSITORY=pytorch/pytorch 2025-12-04T09:29:22.5155286Z LC_ALL=C.UTF-8 2025-12-04T09:29:22.5155383Z GITHUB_RETENTION_DAYS=90 2025-12-04T09:29:22.5155505Z RUNNER_WORKSPACE=/home/runner/_work/pytorch 2025-12-04T09:29:22.5155639Z OPENSSL_DIR=/opt/openssl 2025-12-04T09:29:22.5155753Z GITHUB_ACTION_REPOSITORY= 2025-12-04T09:29:22.5156154Z PATH=/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:29:22.5156508Z GITHUB_BASE_REF= 2025-12-04T09:29:22.5156606Z CI=true 2025-12-04T09:29:22.5156704Z GITHUB_REPOSITORY_OWNER=pytorch 2025-12-04T09:29:22.5156894Z JOB_ID=57116213184 2025-12-04T09:29:22.5156993Z GITHUB_HEAD_REF= 2025-12-04T09:29:22.5157092Z GITHUB_ACTION_REF= 2025-12-04T09:29:22.5157191Z TEST_SHOWLOCALS=False 2025-12-04T09:29:22.5157310Z GITHUB_WORKFLOW=trunk-rocm-mi300 2025-12-04T09:29:22.5157435Z DEBIAN_FRONTEND=noninteractive 2025-12-04T09:29:22.5157639Z GITHUB_OUTPUT=/home/runner/_work/_temp/_runner_file_commands/set_output_b9bfcac5-744a-4881-8bf8-670b1e229ae8 2025-12-04T09:29:22.5157844Z NO_TD=False 2025-12-04T09:29:22.5157932Z OLDPWD=/var/lib/jenkins 2025-12-04T09:29:22.5158033Z _=/usr/bin/env 2025-12-04T09:29:22.5158126Z + echo 'Testing pytorch' 2025-12-04T09:29:22.5158231Z Testing pytorch 2025-12-04T09:29:22.5158328Z + export LANG=C.UTF-8 2025-12-04T09:29:22.5158425Z + LANG=C.UTF-8 2025-12-04T09:29:22.5158514Z + PR_NUMBER= 2025-12-04T09:29:22.5158611Z + [[ default == \d\e\f\a\u\l\t ]] 2025-12-04T09:29:22.5158733Z + export CUDA_VISIBLE_DEVICES=0 2025-12-04T09:29:22.5158847Z + CUDA_VISIBLE_DEVICES=0 2025-12-04T09:29:22.5158956Z + export HIP_VISIBLE_DEVICES=0 2025-12-04T09:29:22.5159072Z + HIP_VISIBLE_DEVICES=0 2025-12-04T09:29:22.5159183Z + [[ default == \d\i\s\t\r\i\b\u\t\e\d ]] 2025-12-04T09:29:22.5159307Z + [[ default == \s\l\o\w ]] 2025-12-04T09:29:22.5159540Z + [[ linux-jammy-rocm-py3.10 == *slow-gradcheck* ]] 2025-12-04T09:29:22.5159687Z + [[ linux-jammy-rocm-py3.10 == *cuda* ]] 2025-12-04T09:29:22.5159818Z + [[ linux-jammy-rocm-py3.10 == *rocm* ]] 2025-12-04T09:29:22.5159953Z + export PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2025-12-04T09:29:22.5160088Z + PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2025-12-04T09:29:22.5160210Z + [[ default == *crossref* ]] 2025-12-04T09:29:22.5160328Z + [[ linux-jammy-rocm-py3.10 == *rocm* ]] 2025-12-04T09:29:22.5160448Z + export VALGRIND=OFF 2025-12-04T09:29:22.5160547Z + VALGRIND=OFF 2025-12-04T09:29:22.5160635Z + rocminfo 2025-12-04T09:29:22.5264418Z ROCk module version 6.12.12 is loaded 2025-12-04T09:29:22.5645137Z ===================== 2025-12-04T09:29:22.5645527Z HSA System Attributes 2025-12-04T09:29:22.5645826Z ===================== 2025-12-04T09:29:22.5646211Z Runtime Version: 1.18 2025-12-04T09:29:22.5646528Z Runtime Ext Version: 1.14 2025-12-04T09:29:22.5646871Z System Timestamp Freq.: 1000.000000MHz 2025-12-04T09:29:22.5647394Z Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) 2025-12-04T09:29:22.5647956Z Machine Model: LARGE 2025-12-04T09:29:22.5648421Z System Endianness: LITTLE 2025-12-04T09:29:22.5648822Z Mwaitx: DISABLED 2025-12-04T09:29:22.5649203Z XNACK enabled: NO 2025-12-04T09:29:22.5649508Z DMAbuf Support: YES 2025-12-04T09:29:22.5649856Z VMM Support: YES 2025-12-04T09:29:22.5650055Z 2025-12-04T09:29:22.5650167Z ========== 2025-12-04T09:29:22.5650447Z HSA Agents 2025-12-04T09:29:22.5650720Z ========== 2025-12-04T09:29:22.5650984Z ******* 2025-12-04T09:29:22.5651253Z Agent 1 2025-12-04T09:29:22.5651569Z ******* 2025-12-04T09:29:22.5651901Z Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T09:29:22.5652276Z Uuid: CPU-XX 2025-12-04T09:29:22.5652680Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T09:29:22.5653029Z Vendor Name: CPU 2025-12-04T09:29:22.5653344Z Feature: None specified 2025-12-04T09:29:22.5653657Z Profile: FULL_PROFILE 2025-12-04T09:29:22.5653975Z Float Round Mode: NEAR 2025-12-04T09:29:22.5654302Z Max Queue Number: 0(0x0) 2025-12-04T09:29:22.5654626Z Queue Min Size: 0(0x0) 2025-12-04T09:29:22.5654941Z Queue Max Size: 0(0x0) 2025-12-04T09:29:22.5655380Z Queue Type: MULTI 2025-12-04T09:29:22.5655678Z Node: 0 2025-12-04T09:29:22.5656031Z Device Type: CPU 2025-12-04T09:29:22.5656310Z Cache Info: 2025-12-04T09:29:22.5656560Z L1: 49152(0xc000) KB 2025-12-04T09:29:22.5656848Z Chip ID: 0(0x0) 2025-12-04T09:29:22.5657149Z ASIC Revision: 0(0x0) 2025-12-04T09:29:22.5657467Z Cacheline Size: 64(0x40) 2025-12-04T09:29:22.5657786Z Max Clock Freq. (MHz): 3300 2025-12-04T09:29:22.5658089Z BDFID: 0 2025-12-04T09:29:22.5658394Z Internal Node ID: 0 2025-12-04T09:29:22.5658714Z Compute Unit: 64 2025-12-04T09:29:22.5659025Z SIMDs per CU: 0 2025-12-04T09:29:22.5659338Z Shader Engines: 0 2025-12-04T09:29:22.5659744Z Shader Arrs. per Eng.: 0 2025-12-04T09:29:22.5660078Z WatchPts on Addr. Ranges:1 2025-12-04T09:29:22.5660370Z Memory Properties: 2025-12-04T09:29:22.5660600Z Features: None 2025-12-04T09:29:22.5660827Z Pool Info: 2025-12-04T09:29:22.5661046Z Pool 1 2025-12-04T09:29:22.5661413Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T09:29:22.5661730Z Size: 1584777168(0x5e75c7d0) KB 2025-12-04T09:29:22.5662042Z Allocatable: TRUE 2025-12-04T09:29:22.5662365Z Alloc Granule: 4KB 2025-12-04T09:29:22.5662712Z Alloc Recommended Granule:4KB 2025-12-04T09:29:22.5663054Z Alloc Alignment: 4KB 2025-12-04T09:29:22.5663391Z Accessible by all: TRUE 2025-12-04T09:29:22.5663679Z Pool 2 2025-12-04T09:29:22.5663952Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T09:29:22.5664262Z Size: 1584777168(0x5e75c7d0) KB 2025-12-04T09:29:22.5664565Z Allocatable: TRUE 2025-12-04T09:29:22.5664889Z Alloc Granule: 4KB 2025-12-04T09:29:22.5665223Z Alloc Recommended Granule:4KB 2025-12-04T09:29:22.5665561Z Alloc Alignment: 4KB 2025-12-04T09:29:22.5665892Z Accessible by all: TRUE 2025-12-04T09:29:22.5666220Z Pool 3 2025-12-04T09:29:22.5666488Z Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED 2025-12-04T09:29:22.5666794Z Size: 1584777168(0x5e75c7d0) KB 2025-12-04T09:29:22.5667109Z Allocatable: TRUE 2025-12-04T09:29:22.5667433Z Alloc Granule: 4KB 2025-12-04T09:29:22.5667771Z Alloc Recommended Granule:4KB 2025-12-04T09:29:22.5668113Z Alloc Alignment: 4KB 2025-12-04T09:29:22.5668446Z Accessible by all: TRUE 2025-12-04T09:29:22.5668733Z Pool 4 2025-12-04T09:29:22.5669001Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T09:29:22.5669310Z Size: 1584777168(0x5e75c7d0) KB 2025-12-04T09:29:22.5669713Z Allocatable: TRUE 2025-12-04T09:29:22.5670038Z Alloc Granule: 4KB 2025-12-04T09:29:22.5670376Z Alloc Recommended Granule:4KB 2025-12-04T09:29:22.5670716Z Alloc Alignment: 4KB 2025-12-04T09:29:22.5671050Z Accessible by all: TRUE 2025-12-04T09:29:22.5671334Z ISA Info: 2025-12-04T09:29:22.5671548Z ******* 2025-12-04T09:29:22.5671760Z Agent 2 2025-12-04T09:29:22.5671967Z ******* 2025-12-04T09:29:22.5672218Z Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T09:29:22.5672527Z Uuid: CPU-XX 2025-12-04T09:29:22.5672851Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T09:29:22.5673192Z Vendor Name: CPU 2025-12-04T09:29:22.5673518Z Feature: None specified 2025-12-04T09:29:22.5673836Z Profile: FULL_PROFILE 2025-12-04T09:29:22.5674161Z Float Round Mode: NEAR 2025-12-04T09:29:22.5674543Z Max Queue Number: 0(0x0) 2025-12-04T09:29:22.5674859Z Queue Min Size: 0(0x0) 2025-12-04T09:29:22.5675172Z Queue Max Size: 0(0x0) 2025-12-04T09:29:22.5675485Z Queue Type: MULTI 2025-12-04T09:29:22.5675785Z Node: 1 2025-12-04T09:29:22.5676121Z Device Type: CPU 2025-12-04T09:29:22.5676403Z Cache Info: 2025-12-04T09:29:22.5676649Z L1: 49152(0xc000) KB 2025-12-04T09:29:22.5676946Z Chip ID: 0(0x0) 2025-12-04T09:29:22.5677254Z ASIC Revision: 0(0x0) 2025-12-04T09:29:22.5677581Z Cacheline Size: 64(0x40) 2025-12-04T09:29:22.5677912Z Max Clock Freq. (MHz): 3300 2025-12-04T09:29:22.5678215Z BDFID: 0 2025-12-04T09:29:22.5678521Z Internal Node ID: 1 2025-12-04T09:29:22.5678840Z Compute Unit: 64 2025-12-04T09:29:22.5679155Z SIMDs per CU: 0 2025-12-04T09:29:22.5679473Z Shader Engines: 0 2025-12-04T09:29:22.5679798Z Shader Arrs. per Eng.: 0 2025-12-04T09:29:22.5680131Z WatchPts on Addr. Ranges:1 2025-12-04T09:29:22.5680430Z Memory Properties: 2025-12-04T09:29:22.5680665Z Features: None 2025-12-04T09:29:22.5680894Z Pool Info: 2025-12-04T09:29:22.5681108Z Pool 1 2025-12-04T09:29:22.5681383Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T09:29:22.5681708Z Size: 1585311812(0x5e7df044) KB 2025-12-04T09:29:22.5682020Z Allocatable: TRUE 2025-12-04T09:29:22.5682348Z Alloc Granule: 4KB 2025-12-04T09:29:22.5682690Z Alloc Recommended Granule:4KB 2025-12-04T09:29:22.5683037Z Alloc Alignment: 4KB 2025-12-04T09:29:22.5683369Z Accessible by all: TRUE 2025-12-04T09:29:22.5683659Z Pool 2 2025-12-04T09:29:22.5683930Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T09:29:22.5684310Z Size: 1585311812(0x5e7df044) KB 2025-12-04T09:29:22.5684620Z Allocatable: TRUE 2025-12-04T09:29:22.5684941Z Alloc Granule: 4KB 2025-12-04T09:29:22.5685282Z Alloc Recommended Granule:4KB 2025-12-04T09:29:22.5685619Z Alloc Alignment: 4KB 2025-12-04T09:29:22.5685981Z Accessible by all: TRUE 2025-12-04T09:29:22.5686269Z Pool 3 2025-12-04T09:29:22.5686540Z Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED 2025-12-04T09:29:22.5686852Z Size: 1585311812(0x5e7df044) KB 2025-12-04T09:29:22.5687159Z Allocatable: TRUE 2025-12-04T09:29:22.5687483Z Alloc Granule: 4KB 2025-12-04T09:29:22.5687827Z Alloc Recommended Granule:4KB 2025-12-04T09:29:22.5688167Z Alloc Alignment: 4KB 2025-12-04T09:29:22.5688500Z Accessible by all: TRUE 2025-12-04T09:29:22.5688848Z Pool 4 2025-12-04T09:29:22.5689116Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T09:29:22.5689424Z Size: 1585311812(0x5e7df044) KB 2025-12-04T09:29:22.5689729Z Allocatable: TRUE 2025-12-04T09:29:22.5690050Z Alloc Granule: 4KB 2025-12-04T09:29:22.5690391Z Alloc Recommended Granule:4KB 2025-12-04T09:29:22.5690731Z Alloc Alignment: 4KB 2025-12-04T09:29:22.5691064Z Accessible by all: TRUE 2025-12-04T09:29:22.5691355Z ISA Info: 2025-12-04T09:29:22.5691567Z ******* 2025-12-04T09:29:22.5691778Z Agent 3 2025-12-04T09:29:22.5691975Z ******* 2025-12-04T09:29:22.5692219Z Name: gfx942 2025-12-04T09:29:22.5692521Z Uuid: GPU-01ff8763ec76c341 2025-12-04T09:29:22.5692836Z Marketing Name: 2025-12-04T09:29:22.5693155Z Vendor Name: AMD 2025-12-04T09:29:22.5693472Z Feature: KERNEL_DISPATCH 2025-12-04T09:29:22.5693787Z Profile: BASE_PROFILE 2025-12-04T09:29:22.5694109Z Float Round Mode: NEAR 2025-12-04T09:29:22.5694432Z Max Queue Number: 128(0x80) 2025-12-04T09:29:22.5694755Z Queue Min Size: 64(0x40) 2025-12-04T09:29:22.5695069Z Queue Max Size: 131072(0x20000) 2025-12-04T09:29:22.5695382Z Queue Type: MULTI 2025-12-04T09:29:22.5695680Z Node: 2 2025-12-04T09:29:22.5696030Z Device Type: GPU 2025-12-04T09:29:22.5696312Z Cache Info: 2025-12-04T09:29:22.5696558Z L1: 32(0x20) KB 2025-12-04T09:29:22.5696838Z L2: 4096(0x1000) KB 2025-12-04T09:29:22.5697114Z L3: 262144(0x40000) KB 2025-12-04T09:29:22.5697399Z Chip ID: 29861(0x74a5) 2025-12-04T09:29:22.5697710Z ASIC Revision: 1(0x1) 2025-12-04T09:29:22.5698033Z Cacheline Size: 128(0x80) 2025-12-04T09:29:22.5698424Z Max Clock Freq. (MHz): 2100 2025-12-04T09:29:22.5698730Z BDFID: 25856 2025-12-04T09:29:22.5699042Z Internal Node ID: 2 2025-12-04T09:29:22.5699366Z Compute Unit: 304 2025-12-04T09:29:22.5699676Z SIMDs per CU: 4 2025-12-04T09:29:22.5699991Z Shader Engines: 32 2025-12-04T09:29:22.5700318Z Shader Arrs. per Eng.: 1 2025-12-04T09:29:22.5700653Z WatchPts on Addr. Ranges:4 2025-12-04T09:29:22.5700990Z Coherent Host Access: FALSE 2025-12-04T09:29:22.5701288Z Memory Properties: 2025-12-04T09:29:22.5701540Z Features: KERNEL_DISPATCH 2025-12-04T09:29:22.5701842Z Fast F16 Operation: TRUE 2025-12-04T09:29:22.5702181Z Wavefront Size: 64(0x40) 2025-12-04T09:29:22.5702508Z Workgroup Max Size: 1024(0x400) 2025-12-04T09:29:22.5702811Z Workgroup Max Size per Dimension: 2025-12-04T09:29:22.5703138Z x 1024(0x400) 2025-12-04T09:29:22.5703415Z y 1024(0x400) 2025-12-04T09:29:22.5703682Z z 1024(0x400) 2025-12-04T09:29:22.5703978Z Max Waves Per CU: 32(0x20) 2025-12-04T09:29:22.5704305Z Max Work-item Per CU: 2048(0x800) 2025-12-04T09:29:22.5704625Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T09:29:22.5704911Z Grid Max Size per Dimension: 2025-12-04T09:29:22.5705157Z x 2147483647(0x7fffffff) 2025-12-04T09:29:22.5705431Z y 65535(0xffff) 2025-12-04T09:29:22.5705699Z z 65535(0xffff) 2025-12-04T09:29:22.5706048Z Max fbarriers/Workgrp: 32 2025-12-04T09:29:22.5706467Z Packet Processor uCode:: 185 2025-12-04T09:29:22.5706808Z SDMA engine uCode:: 24 2025-12-04T09:29:22.5707136Z IOMMU Support:: None 2025-12-04T09:29:22.5707424Z Pool Info: 2025-12-04T09:29:22.5707645Z Pool 1 2025-12-04T09:29:22.5707922Z Segment: GLOBAL; FLAGS: COARSE GRAINED 2025-12-04T09:29:22.5708243Z Size: 268419072(0xfffc000) KB 2025-12-04T09:29:22.5708555Z Allocatable: TRUE 2025-12-04T09:29:22.5708892Z Alloc Granule: 4KB 2025-12-04T09:29:22.5709236Z Alloc Recommended Granule:2048KB 2025-12-04T09:29:22.5709579Z Alloc Alignment: 4KB 2025-12-04T09:29:22.5709917Z Accessible by all: FALSE 2025-12-04T09:29:22.5710212Z Pool 2 2025-12-04T09:29:22.5710486Z Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED 2025-12-04T09:29:22.5710798Z Size: 268419072(0xfffc000) KB 2025-12-04T09:29:22.5711102Z Allocatable: TRUE 2025-12-04T09:29:22.5711425Z Alloc Granule: 4KB 2025-12-04T09:29:22.5711763Z Alloc Recommended Granule:2048KB 2025-12-04T09:29:22.5712102Z Alloc Alignment: 4KB 2025-12-04T09:29:22.5712432Z Accessible by all: FALSE 2025-12-04T09:29:22.5712786Z Pool 3 2025-12-04T09:29:22.5713052Z Segment: GLOBAL; FLAGS: FINE GRAINED 2025-12-04T09:29:22.5713361Z Size: 268419072(0xfffc000) KB 2025-12-04T09:29:22.5713672Z Allocatable: TRUE 2025-12-04T09:29:22.5713994Z Alloc Granule: 4KB 2025-12-04T09:29:22.5714332Z Alloc Recommended Granule:2048KB 2025-12-04T09:29:22.5714667Z Alloc Alignment: 4KB 2025-12-04T09:29:22.5714998Z Accessible by all: FALSE 2025-12-04T09:29:22.5715288Z Pool 4 2025-12-04T09:29:22.5715548Z Segment: GROUP 2025-12-04T09:29:22.5715847Z Size: 64(0x40) KB 2025-12-04T09:29:22.5716225Z Allocatable: FALSE 2025-12-04T09:29:22.5716549Z Alloc Granule: 0KB 2025-12-04T09:29:22.5716889Z Alloc Recommended Granule:0KB 2025-12-04T09:29:22.5717296Z Alloc Alignment: 0KB 2025-12-04T09:29:22.5717626Z Accessible by all: FALSE 2025-12-04T09:29:22.5717916Z ISA Info: 2025-12-04T09:29:22.5718133Z ISA 1 2025-12-04T09:29:22.5718408Z Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- 2025-12-04T09:29:22.5718751Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T09:29:22.5719088Z Profiles: HSA_PROFILE_BASE 2025-12-04T09:29:22.5719422Z Default Rounding Mode: NEAR 2025-12-04T09:29:22.5719767Z Default Rounding Mode: NEAR 2025-12-04T09:29:22.5720099Z Fast f16: TRUE 2025-12-04T09:29:22.5720423Z Workgroup Max Size: 1024(0x400) 2025-12-04T09:29:22.5720728Z Workgroup Max Size per Dimension: 2025-12-04T09:29:22.5721015Z x 1024(0x400) 2025-12-04T09:29:22.5721300Z y 1024(0x400) 2025-12-04T09:29:22.5721573Z z 1024(0x400) 2025-12-04T09:29:22.5721873Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T09:29:22.5722167Z Grid Max Size per Dimension: 2025-12-04T09:29:22.5722422Z x 2147483647(0x7fffffff) 2025-12-04T09:29:22.5722699Z y 65535(0xffff) 2025-12-04T09:29:22.5722973Z z 65535(0xffff) 2025-12-04T09:29:22.5723281Z FBarrier Max Size: 32 2025-12-04T09:29:22.5723564Z ISA 2 2025-12-04T09:29:22.5723860Z Name: amdgcn-amd-amdhsa--gfx9-4-generic:sramecc+:xnack- 2025-12-04T09:29:22.5724232Z Machine Models: HSA_MACHINE_MODEL_LARGE 2025-12-04T09:29:22.5724567Z Profiles: HSA_PROFILE_BASE 2025-12-04T09:29:22.5724901Z Default Rounding Mode: NEAR 2025-12-04T09:29:22.5725244Z Default Rounding Mode: NEAR 2025-12-04T09:29:22.5725564Z Fast f16: TRUE 2025-12-04T09:29:22.5725882Z Workgroup Max Size: 1024(0x400) 2025-12-04T09:29:22.5726219Z Workgroup Max Size per Dimension: 2025-12-04T09:29:22.5726483Z x 1024(0x400) 2025-12-04T09:29:22.5726837Z y 1024(0x400) 2025-12-04T09:29:22.5727109Z z 1024(0x400) 2025-12-04T09:29:22.5727409Z Grid Max Size: 4294967295(0xffffffff) 2025-12-04T09:29:22.5727705Z Grid Max Size per Dimension: 2025-12-04T09:29:22.5727960Z x 2147483647(0x7fffffff) 2025-12-04T09:29:22.5728232Z y 65535(0xffff) 2025-12-04T09:29:22.5728507Z z 65535(0xffff) 2025-12-04T09:29:22.5728807Z FBarrier Max Size: 32 2025-12-04T09:29:22.5729091Z *** Done *** 2025-12-04T09:29:22.5729309Z + rocminfo 2025-12-04T09:29:22.5729511Z + grep -E 'Name:.*\sgfx|Marketing' 2025-12-04T09:29:22.6262282Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T09:29:22.6262912Z Marketing Name: AMD EPYC 9575F 64-Core Processor 2025-12-04T09:29:22.6263435Z Name: gfx942 2025-12-04T09:29:22.6263928Z Marketing Name: 2025-12-04T09:29:22.6313201Z + MAYBE_ROCM=rocm/ 2025-12-04T09:29:22.6313631Z + [[ linux-jammy-rocm-py3.10 == *xpu* ]] 2025-12-04T09:29:22.6314117Z + [[ linux-jammy-rocm-py3.10 != *-bazel-* ]] 2025-12-04T09:29:22.6314558Z + pip_install ninja==1.10.2 2025-12-04T09:29:22.6315049Z + pip_install_pkg='python3 -m pip install --progress-bar off' 2025-12-04T09:29:22.6315637Z + python3 -m pip install --progress-bar off ninja==1.10.2 2025-12-04T09:29:22.8262356Z Collecting ninja==1.10.2 2025-12-04T09:29:22.8665226Z Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (5.0 kB) 2025-12-04T09:29:22.8741240Z Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (108 kB) 2025-12-04T09:29:23.0367707Z Installing collected packages: ninja 2025-12-04T09:29:23.0406314Z Attempting uninstall: ninja 2025-12-04T09:29:23.0415039Z Found existing installation: ninja 1.11.1.4 2025-12-04T09:29:23.0415476Z Uninstalling ninja-1.11.1.4: 2025-12-04T09:29:23.0415645Z Successfully uninstalled ninja-1.11.1.4 2025-12-04T09:29:23.0510633Z Successfully installed ninja-1.10.2 2025-12-04T09:29:23.0828784Z + export PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:29:23.0829538Z + PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/opt/rocm/llvm/bin:/opt/rocm/opencl/bin:/opt/rocm/hip/bin:/opt/rocm/hcc/bin:/opt/rocm/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:29:23.0829985Z + [[ linux-jammy-rocm-py3.10 == *aarch64* ]] 2025-12-04T09:29:23.0830129Z + [[ linux-jammy-rocm-py3.10 == *asan* ]] 2025-12-04T09:29:23.0830291Z + [[ linux-jammy-rocm-py3.10 == *-debug* ]] 2025-12-04T09:29:23.0830430Z + [[ linux-jammy-rocm-py3.10 != *-bazel-* ]] 2025-12-04T09:29:23.0830626Z + echo 'We are not in debug mode: linux-jammy-rocm-py3.10. Expect the assertion to pass' 2025-12-04T09:29:23.0830871Z We are not in debug mode: linux-jammy-rocm-py3.10. Expect the assertion to pass 2025-12-04T09:29:23.0847191Z + cd test 2025-12-04T09:29:23.0866516Z + python -c 'import torch; torch._C._crash_if_debug_asserts_fail(424242)' 2025-12-04T09:29:24.2611404Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]] 2025-12-04T09:29:24.2612209Z + [[ default == \n\o\g\p\u\_\A\V\X\5\1\2 ]] 2025-12-04T09:29:24.2617773Z + [[ default == \l\e\g\a\c\y\_\n\v\i\d\i\a\_\d\r\i\v\e\r ]] 2025-12-04T09:29:24.2618518Z + DYNAMO_BENCHMARK_FLAGS=() 2025-12-04T09:29:24.2618722Z + [[ default == *pr_time_benchmarks* ]] 2025-12-04T09:29:24.2618867Z + [[ default == *dynamo_eager* ]] 2025-12-04T09:29:24.2618992Z + [[ default == *aot_eager* ]] 2025-12-04T09:29:24.2619441Z + [[ default == *aot_inductor* ]] 2025-12-04T09:29:24.2619570Z + [[ default == *max_autotune_inductor* ]] 2025-12-04T09:29:24.2619701Z + [[ default == *inductor* ]] 2025-12-04T09:29:24.2619818Z + [[ default == *dynamic* ]] 2025-12-04T09:29:24.2619941Z + [[ default == *cpu* ]] 2025-12-04T09:29:24.2620047Z + [[ default == *xpu* ]] 2025-12-04T09:29:24.2620177Z + DYNAMO_BENCHMARK_FLAGS+=(--device cuda) 2025-12-04T09:29:24.2627587Z + [[ linux-jammy-rocm-py3.10 == *libtorch* ]] 2025-12-04T09:29:24.2627919Z + [[ linux-jammy-rocm-py3.10 == *-bazel-* ]] 2025-12-04T09:29:24.2628059Z + cd test 2025-12-04T09:29:24.2628196Z + python -c 'import torch; print(torch.__config__.show())' 2025-12-04T09:29:25.1281220Z PyTorch built with: 2025-12-04T09:29:25.1281415Z - GCC 11.4 2025-12-04T09:29:25.1281521Z - C++ Version: 201703 2025-12-04T09:29:25.1281939Z - Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications 2025-12-04T09:29:25.1282508Z - Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d) 2025-12-04T09:29:25.1283561Z - OpenMP 201511 (a.k.a. OpenMP 4.5) 2025-12-04T09:29:25.1283805Z - LAPACK is enabled (usually provided by MKL) 2025-12-04T09:29:25.1283945Z - NNPACK is enabled 2025-12-04T09:29:25.1284444Z - CPU capability usage: AVX512 2025-12-04T09:29:25.1284567Z - HIP Runtime 7.1.25424 2025-12-04T09:29:25.1284673Z - MIOpen 3.5.1 2025-12-04T09:29:25.1284768Z - Magma 2.9.0 2025-12-04T09:29:25.1286736Z - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, COMMIT_SHA=35b7a9a26c5923d98aebaa41a031dae21788a9ee, CXX_COMPILER=/opt/cache/bin/c++, CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_FBGEMM_GENAI -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.10.0, USE_CUDA=OFF, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=ON, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=OFF, USE_XPU=OFF, 2025-12-04T09:29:25.1288513Z 2025-12-04T09:29:25.3462411Z + cd test 2025-12-04T09:29:25.3463071Z + python -c 'import torch; print(torch.__config__.parallel_info())' 2025-12-04T09:29:26.2109444Z ATen/Parallel: 2025-12-04T09:29:26.2109666Z at::get_num_threads() : 128 2025-12-04T09:29:26.2109814Z at::get_num_interop_threads() : 128 2025-12-04T09:29:26.2109951Z OpenMP 201511 (a.k.a. OpenMP 4.5) 2025-12-04T09:29:26.2110118Z omp_get_max_threads() : 128 2025-12-04T09:29:26.2110341Z Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications 2025-12-04T09:29:26.2110565Z mkl_get_max_threads() : 128 2025-12-04T09:29:26.2110737Z Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d) 2025-12-04T09:29:26.2110914Z std::thread::hardware_concurrency() : 128 2025-12-04T09:29:26.2111043Z Environment variables: 2025-12-04T09:29:26.2111152Z OMP_NUM_THREADS : [not set] 2025-12-04T09:29:26.2111261Z MKL_NUM_THREADS : [not set] 2025-12-04T09:29:26.2111376Z ATen parallel backend: OpenMP 2025-12-04T09:29:26.2111451Z 2025-12-04T09:29:26.4537402Z + [[ default == *numpy_2* ]] 2025-12-04T09:29:26.4537573Z + [[ linux-jammy-rocm-py3.10 == *aarch64* ]] 2025-12-04T09:29:26.4537711Z + [[ default == *backward* ]] 2025-12-04T09:29:26.4537848Z + [[ default == *libtorch_agnostic_targetting* ]] 2025-12-04T09:29:26.4537981Z + [[ default == *xla* ]] 2025-12-04T09:29:26.4538435Z + [[ default == *vllm* ]] 2025-12-04T09:29:26.4538546Z + [[ default == *executorch* ]] 2025-12-04T09:29:26.4538793Z + [[ default == \j\i\t\_\l\e\g\a\c\y ]] 2025-12-04T09:29:26.4539026Z + [[ default == \q\u\a\n\t\i\z\a\t\i\o\n ]] 2025-12-04T09:29:26.4539203Z + [[ linux-jammy-rocm-py3.10 == *libtorch* ]] 2025-12-04T09:29:26.4539335Z + [[ default == distributed ]] 2025-12-04T09:29:26.4539455Z + [[ default == *operator_benchmark* ]] 2025-12-04T09:29:26.4539591Z + [[ default == *operator_microbenchmark* ]] 2025-12-04T09:29:26.4539782Z + [[ default == *attention_microbenchmark* ]] 2025-12-04T09:29:26.4539969Z + [[ default == *inductor_distributed* ]] 2025-12-04T09:29:26.4540157Z + [[ default == *inductor-halide* ]] 2025-12-04T09:29:26.4540340Z + [[ default == *inductor-pallas* ]] 2025-12-04T09:29:26.4540528Z + [[ default == *inductor-triton-cpu* ]] 2025-12-04T09:29:26.4540724Z + [[ default == *inductor-micro-benchmark* ]] 2025-12-04T09:29:26.4540922Z + [[ default == *aoti_cross_compile_for_windows* ]] 2025-12-04T09:29:26.4541120Z + [[ default == *huggingface* ]] 2025-12-04T09:29:26.4541294Z + [[ default == *timm* ]] 2025-12-04T09:29:26.4541427Z + [[ default == cachebench ]] 2025-12-04T09:29:26.4541625Z + [[ default == verify_cachebench ]] 2025-12-04T09:29:26.4542126Z + [[ default == *torchbench* ]] 2025-12-04T09:29:26.4542311Z + [[ default == *inductor_cpp_wrapper* ]] 2025-12-04T09:29:26.4542506Z + [[ default == *inductor_core* ]] 2025-12-04T09:29:26.4542704Z + [[ default == *inductor* ]] 2025-12-04T09:29:26.4543674Z + [[ default == *einops* ]] 2025-12-04T09:29:26.4543887Z + [[ default == *dynamo_core* ]] 2025-12-04T09:29:26.4544022Z + [[ default == *dynamo_wrapped* ]] 2025-12-04T09:29:26.4544163Z + [[ linux-jammy-rocm-py3.10 == *rocm* ]] 2025-12-04T09:29:26.4544297Z + [[ -n '' ]] 2025-12-04T09:29:26.4544393Z + [[ 6 == 1 ]] 2025-12-04T09:29:26.4544485Z + [[ 6 == 2 ]] 2025-12-04T09:29:26.4544571Z + [[ 6 -gt 2 ]] 2025-12-04T09:29:26.4544670Z + install_torchvision 2025-12-04T09:29:26.4544777Z + local orig_preload 2025-12-04T09:29:26.4544906Z + local commit 2025-12-04T09:29:26.4545003Z ++ get_pinned_commit vision 2025-12-04T09:29:26.4545124Z ++ cat .github/ci_commit_pins/vision.txt 2025-12-04T09:29:26.4556427Z + commit=617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:29:26.4595126Z + orig_preload= 2025-12-04T09:29:26.4595402Z + '[' -n '' ']' 2025-12-04T09:29:26.4595748Z + [[ linux-jammy-rocm-py3.10 == *cuda* ]] 2025-12-04T09:29:26.4602728Z + pip_build_and_install git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e dist/vision 2025-12-04T09:29:26.4603093Z + local build_target=git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:29:26.4603323Z + local wheel_dir=dist/vision 2025-12-04T09:29:26.4603458Z + local found_whl=0 2025-12-04T09:29:26.4603579Z + for file in "${wheel_dir}"/*.whl 2025-12-04T09:29:26.4603714Z + [[ -f dist/vision/*.whl ]] 2025-12-04T09:29:26.4603837Z + '[' 0 == 0 ']' 2025-12-04T09:29:26.4604114Z + python3 -m pip wheel --no-build-isolation --no-deps -w dist/vision git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:29:26.6132103Z Collecting git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:29:26.6135578Z Cloning https://github.com/pytorch/vision.git (to revision 617079d944b0e72632311c30ae2bbdf1168b901e) to /tmp/pip-req-build-bc2rjq7_ 2025-12-04T09:29:26.6153933Z Running command git clone --filter=blob:none --quiet https://github.com/pytorch/vision.git /tmp/pip-req-build-bc2rjq7_ 2025-12-04T09:29:30.1569040Z Running command git rev-parse -q --verify 'sha^617079d944b0e72632311c30ae2bbdf1168b901e' 2025-12-04T09:29:30.1577139Z Running command git fetch -q https://github.com/pytorch/vision.git 617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:29:31.4562269Z Resolved https://github.com/pytorch/vision.git to commit 617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:29:33.2593545Z Preparing metadata (pyproject.toml) ... [?25l- \ | done 2025-12-04T09:29:33.2615650Z [?25hBuilding wheels for collected packages: torchvision 2025-12-04T09:30:14.6985592Z Building wheel for torchvision (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-12-04T09:30:14.7006420Z [?25h Created wheel for torchvision: filename=torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl size=1809038 sha256=bdb203bc9127564fb3315aa3baa574131e79c659235b2928f852b0fd251879eb 2025-12-04T09:30:14.7006909Z Stored in directory: /var/lib/jenkins/.cache/pip/wheels/12/b2/29/1f82685c5b5173629e1f36a9b93989ce92ce563e5fb91d27ac 2025-12-04T09:30:14.7037652Z Successfully built torchvision 2025-12-04T09:30:14.7689854Z + for file in "${wheel_dir}"/*.whl 2025-12-04T09:30:14.7696388Z + pip_install_whl dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:30:14.7697093Z + args=('dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl') 2025-12-04T09:30:14.7697328Z + local args 2025-12-04T09:30:14.7697504Z + [[ dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl == *\ * ]] 2025-12-04T09:30:14.7697714Z + for path in "${args[@]}" 2025-12-04T09:30:14.7698360Z + echo 'Installing dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl' 2025-12-04T09:30:14.7698621Z Installing dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:30:14.7698915Z + python3 -mpip install --no-index --no-deps dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:30:14.9189117Z Processing ./dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:30:14.9228514Z Installing collected packages: torchvision 2025-12-04T09:30:15.1318394Z Successfully installed torchvision-0.25.0a0+617079d 2025-12-04T09:30:15.1532520Z + '[' -n '' ']' 2025-12-04T09:30:15.1533073Z + test_python_shard 6 2025-12-04T09:30:15.1533196Z + [[ -z 6 ]] 2025-12-04T09:30:15.1533539Z + python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests --shard 6 6 --verbose --upload-artifacts-while-running 2025-12-04T09:30:17.0984043Z Excluding inductor/test_max_autotune on ROCm 2025-12-04T09:30:17.0984340Z Excluding test_cuda_nvml_based_avail on ROCm 2025-12-04T09:30:18.1125034Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/pytorch/test/.pytorch-disabled-tests.json 2025-12-04T09:30:18.4564580Z Ignoring disabled issues: [''] 2025-12-04T09:30:18.4612405Z Found test times from artifacts 2025-12-04T09:30:18.4786002Z Found test times from artifacts 2025-12-04T09:30:18.4790975Z Running all tests 2025-12-04T09:30:18.5079128Z Running parallel tests on 1 processes 2025-12-04T09:30:18.5082746Z Name: tests to run (est. time: 185.56min) 2025-12-04T09:30:18.5082943Z Serial tests (118): 2025-12-04T09:30:18.5083085Z inductor/test_torchinductor 2/2 2025-12-04T09:30:18.5083307Z inductor/test_torchinductor_dynamic_shapes 3/4 2025-12-04T09:30:18.5083826Z inductor/test_kernel_benchmark 1/1 2025-12-04T09:30:18.5084205Z inductor/test_torchinductor_opinfo 1/12 2025-12-04T09:30:18.5134925Z inductor/test_torchinductor_opinfo 7/12 2025-12-04T09:30:18.5136168Z inductor/test_pattern_matcher 1/1 2025-12-04T09:30:18.5136395Z inductor/test_cpu_repro 1/5 2025-12-04T09:30:18.5136535Z inductor/test_compiled_autograd 1/2 2025-12-04T09:30:18.5136667Z dynamo/test_unspec 1/1 2025-12-04T09:30:18.5136786Z dynamo/test_higher_order_ops 1/1 2025-12-04T09:30:18.5136919Z inductor/test_flex_attention 1/4 2025-12-04T09:30:18.5137045Z inductor/test_halide 1/1 2025-12-04T09:30:18.5137165Z inductor/test_compile_subprocess 2/3 2025-12-04T09:30:18.5137298Z inductor/test_deterministic 4/4 2025-12-04T09:30:18.5137434Z export/test_functionalized_assertions 1/1 2025-12-04T09:30:18.5137572Z inductor/test_loop_ordering 1/1 2025-12-04T09:30:18.5137698Z export/test_serialize 1/1 2025-12-04T09:30:18.5138209Z inductor/test_cutedsl_template 1/1 2025-12-04T09:30:18.5138338Z inductor/test_benchmark_fusion 1/1 2025-12-04T09:30:18.5138461Z export/test_serdes 1/1 2025-12-04T09:30:18.5138574Z inductor/test_combo_kernels 1/1 2025-12-04T09:30:18.5138709Z inductor/test_control_deps 1/1 2025-12-04T09:30:18.5138838Z inductor/test_compiled_optimizers 2/2 2025-12-04T09:30:18.5138967Z dynamo/test_unittest 1/1 2025-12-04T09:30:18.5139079Z dynamo/test_streams 1/1 2025-12-04T09:30:18.5139193Z inductor/test_unbacked_symints 1/1 2025-12-04T09:30:18.5139321Z inductor/test_mix_order_reduction 1/1 2025-12-04T09:30:18.5139449Z dynamo/test_cudagraphs 1/1 2025-12-04T09:30:18.5139562Z inductor/test_alignment 1/1 2025-12-04T09:30:18.5139676Z inductor/test_padding 1/1 2025-12-04T09:30:18.5139788Z dynamo/test_profiler 1/1 2025-12-04T09:30:18.5139904Z dynamo/test_guard_serialization 1/1 2025-12-04T09:30:18.5140026Z dynamo/test_compile 1/1 2025-12-04T09:30:18.5140143Z dynamo/test_nested_graph_breaks 1/1 2025-12-04T09:30:18.5140265Z dynamo/test_dicts 1/1 2025-12-04T09:30:18.5140379Z inductor/test_needs_exact_strides 1/1 2025-12-04T09:30:18.5140508Z inductor/test_auto_functionalize 1/1 2025-12-04T09:30:18.5140782Z inductor/test_split_cat_fx_aten_passes 1/1 2025-12-04T09:30:18.5140914Z inductor/test_minifier_isolate 1/1 2025-12-04T09:30:18.5141035Z dynamo/test_aot_compile 1/1 2025-12-04T09:30:18.5141147Z dynamo/test_list 1/1 2025-12-04T09:30:18.5141253Z dynamo/test_resume 1/1 2025-12-04T09:30:18.5141371Z inductor/test_augmented_graph_helper 1/1 2025-12-04T09:30:18.5141497Z dynamo/test_deviceguard 1/1 2025-12-04T09:30:18.5141613Z dynamo/test_sources 1/1 2025-12-04T09:30:18.5141735Z dynamo/test_backward_higher_order_ops 1/1 2025-12-04T09:30:18.5141867Z dynamo/test_modes 1/1 2025-12-04T09:30:18.5141978Z dynamo/test_optimizers 1/1 2025-12-04T09:30:18.5142092Z export/test_torchbind 1/1 2025-12-04T09:30:18.5142222Z inductor/test_custom_partitioner_fn 1/1 2025-12-04T09:30:18.5142351Z dynamo/test_debug_utils 1/1 2025-12-04T09:30:18.5142465Z dynamo/test_base_hop 1/1 2025-12-04T09:30:18.5142576Z dynamo/test_export 1/1 2025-12-04T09:30:18.5142690Z dynamo/test_sets 1/1 2025-12-04T09:30:18.5142797Z dynamo/test_package 1/1 2025-12-04T09:30:18.5142914Z inductor/test_efficient_conv_bn_eval 1/1 2025-12-04T09:30:18.5143040Z inductor/test_torchbind 1/1 2025-12-04T09:30:18.5143157Z dynamo/test_python_dispatcher 1/1 2025-12-04T09:30:18.5143279Z export/test_swap 1/1 2025-12-04T09:30:18.5143386Z export/test_unflatten 1/1 2025-12-04T09:30:18.5143503Z dynamo/test_verify_correctness 1/1 2025-12-04T09:30:18.5143639Z dynamo/test_wrap_inductor_compiled_regions 1/1 2025-12-04T09:30:18.5143780Z inductor/test_fxir_backend 1/1 2025-12-04T09:30:18.5143900Z dynamo/test_bytecode_utils 1/1 2025-12-04T09:30:18.5144040Z dynamo/test_tree_map 1/1 2025-12-04T09:30:18.5144153Z dynamo/test_minifier 1/1 2025-12-04T09:30:18.5144270Z dynamo/test_guard_manager 1/1 2025-12-04T09:30:18.5144386Z export/test_schema 1/1 2025-12-04T09:30:18.5144495Z dynamo/test_torchrec 1/1 2025-12-04T09:30:18.5144607Z export/test_pass_infra 1/1 2025-12-04T09:30:18.5144728Z dynamo/test_recompile_ux 1/1 2025-12-04T09:30:18.5144869Z inductor/test_cudagraph_trees_expandable_segments 1/1 2025-12-04T09:30:18.5145040Z test_autoload 1/1 2025-12-04T09:30:18.5145142Z test_foreach 1/1 2025-12-04T09:30:18.5145247Z functorch/test_minifier 1/1 2025-12-04T09:30:18.5145370Z higher_order_ops/test_invoke_quant 1/1 2025-12-04T09:30:18.5145496Z torch_np/test_basic 1/1 2025-12-04T09:30:18.5145613Z higher_order_ops/test_with_effects 1/1 2025-12-04T09:30:18.5145734Z test_decomp 1/12 2025-12-04T09:30:18.5145833Z test_decomp 7/12 2025-12-04T09:30:18.5145964Z test_complex 1/1 2025-12-04T09:30:18.5146064Z test_optim 1/1 2025-12-04T09:30:18.5146160Z test_fx 1/2 2025-12-04T09:30:18.5146319Z test_functionalization_of_rng_ops 1/1 2025-12-04T09:30:18.5146447Z test_fx_reinplace_pass 1/1 2025-12-04T09:30:18.5146563Z functorch/test_control_flow 1/3 2025-12-04T09:30:18.5146687Z test_cuda_expandable_segments 1/1 2025-12-04T09:30:18.5146807Z test_autocast 1/1 2025-12-04T09:30:18.5146907Z test_logging 1/1 2025-12-04T09:30:18.5147010Z test_python_dispatch 1/1 2025-12-04T09:30:18.5147123Z nn/test_lazy_modules 1/1 2025-12-04T09:30:18.5147235Z nn/test_pruning 1/1 2025-12-04T09:30:18.5147337Z test_monitor 1/1 2025-12-04T09:30:18.5147438Z test_cuda_sanitizer 1/1 2025-12-04T09:30:18.5147546Z test_bundled_inputs 1/1 2025-12-04T09:30:18.5147665Z torch_np/numpy_tests/core/test_numeric 1/1 2025-12-04T09:30:18.5147805Z torch_np/numpy_tests/core/test_multiarray 1/1 2025-12-04T09:30:18.5147931Z test_itt 1/1 2025-12-04T09:30:18.5148040Z torch_np/numpy_tests/lib/test_function_base 1/1 2025-12-04T09:30:18.5148172Z test_masked 1/1 2025-12-04T09:30:18.5148271Z test_sympy_utils 1/1 2025-12-04T09:30:18.5148375Z test_jit_disabled 1/1 2025-12-04T09:30:18.5148496Z test_subclass 1/1 2025-12-04T09:30:18.5148599Z test_import_stats 1/1 2025-12-04T09:30:18.5148713Z functorch/test_vmap_registrations 1/1 2025-12-04T09:30:18.5148911Z nn/test_parametrization 1/1 2025-12-04T09:30:18.5149033Z complex_tensor/test_complex_tensor 1/1 2025-12-04T09:30:18.5149167Z benchmark_utils/test_benchmark_utils 1/1 2025-12-04T09:30:18.5149293Z functorch/test_dims 1/1 2025-12-04T09:30:18.5149415Z torch_np/numpy_tests/core/test_scalarmath 1/1 2025-12-04T09:30:18.5149546Z test_scaled_matmul_cuda 1/1 2025-12-04T09:30:18.5149672Z torch_np/numpy_tests/core/test_shape_base 1/1 2025-12-04T09:30:18.5149796Z test_vulkan 1/1 2025-12-04T09:30:18.5149893Z lazy/test_generator 1/1 2025-12-04T09:30:18.5150000Z nn/test_convolution 1/1 2025-12-04T09:30:18.5150108Z functorch/test_ops 3/4 2025-12-04T09:30:18.5150215Z nn/test_embedding 1/1 2025-12-04T09:30:18.5150322Z test_unary_ufuncs 1/1 2025-12-04T09:30:18.5150428Z Parallel tests (0): 2025-12-04T09:30:18.5150540Z Name: excluded (est. time: 0.0min) 2025-12-04T09:30:18.5150655Z Serial tests (0): 2025-12-04T09:30:18.5150754Z Parallel tests (0): 2025-12-04T09:30:18.5150927Z Running inductor/test_torchinductor 2/2 ... [2025-12-04 09:30:18.510158][5634639.016560447] 2025-12-04T09:30:18.5151124Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:30:18.5151593Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor.py', '--shard-id=2', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:30:18.510521] 2025-12-04T09:31:25.6822623Z 2025-12-04T09:31:25.6823207Z PRINTING LOG FILE of inductor/test_torchinductor 2/2 (test/test-reports/inductor.test_torchinductor_2.2_916af9a5c16d1706_.log) 2025-12-04T09:31:25.6823696Z Test results will be stored in test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:31:25.6823984Z ============================= test session starts ============================== 2025-12-04T09:31:25.6824228Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T09:31:25.6824434Z cachedir: .pytest_cache 2025-12-04T09:31:25.6824668Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T09:31:25.6824916Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T09:31:25.6825042Z configfile: pytest.ini 2025-12-04T09:31:25.6825282Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T09:31:25.6825538Z collecting ... collected 999 items 2025-12-04T09:31:25.6825694Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T09:31:25.6853224Z Running 250 items in this shard: test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.6881016Z 2025-12-04T09:31:25.6881149Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [1.9205s] [ 0%] 2025-12-04T09:31:25.6881433Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.2431s] [ 0%] 2025-12-04T09:31:25.6881716Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1274s] [ 1%] 2025-12-04T09:31:25.6881998Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0863s] [ 1%] 2025-12-04T09:31:25.6882284Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.5829s] [ 2%] 2025-12-04T09:31:25.6882575Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.4524s] [ 2%] 2025-12-04T09:31:25.6882854Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.4730s] [ 2%] 2025-12-04T09:31:25.6883188Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.4390s] [ 2%] 2025-12-04T09:31:25.6883460Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.4056s] [ 2%] 2025-12-04T09:31:25.6883735Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.4011s] [ 2%] 2025-12-04T09:31:25.6884006Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.4176s] [ 2%] 2025-12-04T09:31:25.6884277Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.4682s] [ 2%] 2025-12-04T09:31:25.6884550Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.3935s] [ 2%] 2025-12-04T09:31:25.6884825Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.4179s] [ 2%] 2025-12-04T09:31:25.6885095Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.7098s] [ 2%] 2025-12-04T09:31:25.6885374Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.4034s] [ 2%] 2025-12-04T09:31:25.6885646Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.4078s] [ 2%] 2025-12-04T09:31:25.6885918Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.4175s] [ 2%] 2025-12-04T09:31:25.6886245Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.3914s] [ 2%] 2025-12-04T09:31:25.6886515Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.4112s] [ 2%] 2025-12-04T09:31:25.6886785Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2242s] [ 2%] 2025-12-04T09:31:25.6887058Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2252s] [ 2%] 2025-12-04T09:31:25.6889336Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2527s] [ 2%] 2025-12-04T09:31:25.6889614Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2283s] [ 2%] 2025-12-04T09:31:25.6889883Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2437s] [ 2%] 2025-12-04T09:31:25.6890155Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2363s] [ 2%] 2025-12-04T09:31:25.6890426Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2421s] [ 2%] 2025-12-04T09:31:25.6890698Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2327s] [ 2%] 2025-12-04T09:31:25.6890967Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2454s] [ 2%] 2025-12-04T09:31:25.6891321Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2289s] [ 2%] 2025-12-04T09:31:25.6891595Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2423s] [ 2%] 2025-12-04T09:31:25.6891870Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2468s] [ 2%] 2025-12-04T09:31:25.6892141Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2378s] [ 2%] 2025-12-04T09:31:25.6892414Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2400s] [ 2%] 2025-12-04T09:31:25.6892686Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2446s] [ 2%] 2025-12-04T09:31:25.6892958Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2320s] [ 2%] 2025-12-04T09:31:25.6893226Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2278s] [ 2%] 2025-12-04T09:31:25.6893501Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2381s] [ 2%] 2025-12-04T09:31:25.6893770Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2339s] [ 2%] 2025-12-04T09:31:25.6894087Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2409s] [ 2%] 2025-12-04T09:31:25.6894355Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2379s] [ 2%] 2025-12-04T09:31:25.6894624Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2289s] [ 2%] 2025-12-04T09:31:25.6894893Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.5333s] [ 2%] 2025-12-04T09:31:25.6895161Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2364s] [ 2%] 2025-12-04T09:31:25.6895430Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2233s] [ 2%] 2025-12-04T09:31:25.6895707Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2156s] [ 2%] 2025-12-04T09:31:25.6896038Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2341s] [ 2%] 2025-12-04T09:31:25.6896311Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2178s] [ 2%] 2025-12-04T09:31:25.6896588Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2419s] [ 2%] 2025-12-04T09:31:25.6896860Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2274s] [ 2%] 2025-12-04T09:31:25.6897130Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2369s] [ 2%] 2025-12-04T09:31:25.6897400Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2288s] [ 2%] 2025-12-04T09:31:25.6897671Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2375s] [ 2%] 2025-12-04T09:31:25.6897947Z inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda PASSED [0.2589s] [ 2%] 2025-12-04T09:31:25.6898222Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1408s] [ 2%] 2025-12-04T09:31:25.6898507Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1606s] [ 2%] 2025-12-04T09:31:25.6898791Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1483s] [ 2%] 2025-12-04T09:31:25.6899071Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1545s] [ 2%] 2025-12-04T09:31:25.6899349Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1382s] [ 2%] 2025-12-04T09:31:25.6899627Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1553s] [ 2%] 2025-12-04T09:31:25.6899907Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1433s] [ 2%] 2025-12-04T09:31:25.6900234Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1541s] [ 2%] 2025-12-04T09:31:25.6900509Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1422s] [ 2%] 2025-12-04T09:31:25.6900788Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1492s] [ 2%] 2025-12-04T09:31:25.6901063Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1467s] [ 2%] 2025-12-04T09:31:25.6901343Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1610s] [ 2%] 2025-12-04T09:31:25.6901619Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1447s] [ 2%] 2025-12-04T09:31:25.6901895Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1520s] [ 2%] 2025-12-04T09:31:25.6902174Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1418s] [ 2%] 2025-12-04T09:31:25.6902453Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.4663s] [ 2%] 2025-12-04T09:31:25.6902728Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1341s] [ 2%] 2025-12-04T09:31:25.6903041Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1568s] [ 2%] 2025-12-04T09:31:25.6903319Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1467s] [ 2%] 2025-12-04T09:31:25.6903597Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1568s] [ 2%] 2025-12-04T09:31:25.6903880Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1349s] [ 2%] 2025-12-04T09:31:25.6904159Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1531s] [ 2%] 2025-12-04T09:31:25.6904435Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1391s] [ 2%] 2025-12-04T09:31:25.6904713Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1610s] [ 2%] 2025-12-04T09:31:25.6904990Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1413s] [ 2%] 2025-12-04T09:31:25.6905272Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1545s] [ 2%] 2025-12-04T09:31:25.6905549Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1353s] [ 2%] 2025-12-04T09:31:25.6905825Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1522s] [ 2%] 2025-12-04T09:31:25.6906133Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1387s] [ 2%] 2025-12-04T09:31:25.6906413Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1460s] [ 2%] 2025-12-04T09:31:25.6906696Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1438s] [ 2%] 2025-12-04T09:31:25.6906972Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1535s] [ 2%] 2025-12-04T09:31:25.6907251Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1675s] [ 2%] 2025-12-04T09:31:25.6907552Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1476s] [ 2%] 2025-12-04T09:31:25.6907868Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1511s] [ 2%] 2025-12-04T09:31:25.6908149Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1520s] [ 2%] 2025-12-04T09:31:25.6908429Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1481s] [ 2%] 2025-12-04T09:31:25.6908707Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1397s] [ 2%] 2025-12-04T09:31:25.6909033Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1375s] [ 2%] 2025-12-04T09:31:25.6909318Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1468s] [ 2%] 2025-12-04T09:31:25.6909594Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1410s] [ 2%] 2025-12-04T09:31:25.6909873Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1467s] [ 2%] 2025-12-04T09:31:25.6910151Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1448s] [ 2%] 2025-12-04T09:31:25.6910428Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.4891s] [ 2%] 2025-12-04T09:31:25.6910705Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1531s] [ 2%] 2025-12-04T09:31:25.6910982Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1441s] [ 2%] 2025-12-04T09:31:25.6911264Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1432s] [ 2%] 2025-12-04T09:31:25.6911547Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1440s] [ 2%] 2025-12-04T09:31:25.6911863Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_slice_scatter_cuda PASSED [0.1511s] [ 2%] 2025-12-04T09:31:25.6912146Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1366s] [ 2%] 2025-12-04T09:31:25.6912424Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1164s] [ 2%] 2025-12-04T09:31:25.6912702Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1109s] [ 2%] 2025-12-04T09:31:25.6912978Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1110s] [ 2%] 2025-12-04T09:31:25.6913252Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1115s] [ 2%] 2025-12-04T09:31:25.6913528Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1101s] [ 2%] 2025-12-04T09:31:25.6913806Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1120s] [ 2%] 2025-12-04T09:31:25.6914088Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1101s] [ 2%] 2025-12-04T09:31:25.6914367Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1111s] [ 2%] 2025-12-04T09:31:25.6914644Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1108s] [ 2%] 2025-12-04T09:31:25.6914922Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1105s] [ 2%] 2025-12-04T09:31:25.6915197Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1115s] [ 2%] 2025-12-04T09:31:25.6915473Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1113s] [ 2%] 2025-12-04T09:31:25.6915757Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1096s] [ 2%] 2025-12-04T09:31:25.6916086Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1112s] [ 2%] 2025-12-04T09:31:25.6916365Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1159s] [ 2%] 2025-12-04T09:31:25.6916643Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1086s] [ 2%] 2025-12-04T09:31:25.6916917Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1099s] [ 2%] 2025-12-04T09:31:25.6917195Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1109s] [ 2%] 2025-12-04T09:31:25.6917470Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1107s] [ 2%] 2025-12-04T09:31:25.6917743Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1107s] [ 2%] 2025-12-04T09:31:25.6918061Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1104s] [ 2%] 2025-12-04T09:31:25.6918342Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1097s] [ 2%] 2025-12-04T09:31:25.6918625Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1198s] [ 2%] 2025-12-04T09:31:25.6918900Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1083s] [ 2%] 2025-12-04T09:31:25.6919184Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1094s] [ 2%] 2025-12-04T09:31:25.6919466Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1113s] [ 2%] 2025-12-04T09:31:25.6919748Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1169s] [ 2%] 2025-12-04T09:31:25.6920031Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1186s] [ 2%] 2025-12-04T09:31:25.6920316Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1085s] [ 2%] 2025-12-04T09:31:25.6920591Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1072s] [ 2%] 2025-12-04T09:31:25.6920900Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1092s] [ 2%] 2025-12-04T09:31:25.6921176Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1179s] [ 2%] 2025-12-04T09:31:25.6921452Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1081s] [ 2%] 2025-12-04T09:31:25.6921730Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1199s] [ 2%] 2025-12-04T09:31:25.6922005Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1229s] [ 2%] 2025-12-04T09:31:25.6922287Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1146s] [ 2%] 2025-12-04T09:31:25.6922574Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1088s] [ 2%] 2025-12-04T09:31:25.6922854Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1088s] [ 2%] 2025-12-04T09:31:25.6923134Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1101s] [ 2%] 2025-12-04T09:31:25.6923411Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1097s] [ 2%] 2025-12-04T09:31:25.6923686Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1210s] [ 2%] 2025-12-04T09:31:25.6923963Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1133s] [ 2%] 2025-12-04T09:31:25.6924239Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1169s] [ 2%] 2025-12-04T09:31:25.6924519Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1079s] [ 2%] 2025-12-04T09:31:25.6924795Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1092s] [ 2%] 2025-12-04T09:31:25.6925069Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.5423s] [ 2%] 2025-12-04T09:31:25.6925349Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1106s] [ 2%] 2025-12-04T09:31:25.6925624Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_default_cuda PASSED [0.1184s] [ 2%] 2025-12-04T09:31:25.6925900Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0844s] [ 2%] 2025-12-04T09:31:25.6941099Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0815s] [ 2%] 2025-12-04T09:31:25.6941394Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0814s] [ 2%] 2025-12-04T09:31:25.6941783Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0809s] [ 2%] 2025-12-04T09:31:25.6942058Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0914s] [ 2%] 2025-12-04T09:31:25.6942332Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0810s] [ 2%] 2025-12-04T09:31:25.6942614Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0809s] [ 2%] 2025-12-04T09:31:25.6942885Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0826s] [ 2%] 2025-12-04T09:31:25.6943157Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0821s] [ 2%] 2025-12-04T09:31:25.6943429Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0825s] [ 2%] 2025-12-04T09:31:25.6943700Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0817s] [ 2%] 2025-12-04T09:31:25.6943975Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0843s] [ 2%] 2025-12-04T09:31:25.6944247Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0880s] [ 2%] 2025-12-04T09:31:25.6944520Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0898s] [ 2%] 2025-12-04T09:31:25.6944833Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0830s] [ 2%] 2025-12-04T09:31:25.6945105Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0807s] [ 2%] 2025-12-04T09:31:25.6945376Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0899s] [ 2%] 2025-12-04T09:31:25.6945651Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0813s] [ 2%] 2025-12-04T09:31:25.6945996Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0812s] [ 2%] 2025-12-04T09:31:25.6946276Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0910s] [ 2%] 2025-12-04T09:31:25.6946557Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0816s] [ 2%] 2025-12-04T09:31:25.6946834Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0800s] [ 2%] 2025-12-04T09:31:25.6947113Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0804s] [ 2%] 2025-12-04T09:31:25.6947389Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0845s] [ 2%] 2025-12-04T09:31:25.6947665Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0870s] [ 2%] 2025-12-04T09:31:25.6947936Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.1670s] [ 2%] 2025-12-04T09:31:25.6948207Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0958s] [ 2%] 2025-12-04T09:31:25.6948481Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.1178s] [ 2%] 2025-12-04T09:31:25.6948753Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0935s] [ 2%] 2025-12-04T09:31:25.6949024Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0872s] [ 2%] 2025-12-04T09:31:25.6956448Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0839s] [ 2%] 2025-12-04T09:31:25.6956837Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0802s] [ 2%] 2025-12-04T09:31:25.6957119Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0797s] [ 2%] 2025-12-04T09:31:25.6957394Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0799s] [ 2%] 2025-12-04T09:31:25.6957666Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0797s] [ 2%] 2025-12-04T09:31:25.6958254Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0796s] [ 2%] 2025-12-04T09:31:25.6958521Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0816s] [ 2%] 2025-12-04T09:31:25.6958791Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0799s] [ 2%] 2025-12-04T09:31:25.6959071Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0793s] [ 2%] 2025-12-04T09:31:25.6959358Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0798s] [ 2%] 2025-12-04T09:31:25.6959625Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0789s] [ 2%] 2025-12-04T09:31:25.6959892Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0842s] [ 2%] 2025-12-04T09:31:25.6960159Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0840s] [ 2%] 2025-12-04T09:31:25.6960431Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0825s] [ 2%] 2025-12-04T09:31:25.6960699Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0800s] [ 2%] 2025-12-04T09:31:25.6960965Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0792s] [ 2%] 2025-12-04T09:31:25.6961377Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0788s] [ 2%] 2025-12-04T09:31:25.6961642Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0799s] [ 2%] 2025-12-04T09:31:25.6961908Z inductor/test_torchinductor.py::GPUTests::test_remove_noop_view_dtype_cuda PASSED [0.0795s] [ 2%] 2025-12-04T09:31:25.6962184Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.5271s] [ 2%] 2025-12-04T09:31:25.6962469Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.5152s] [ 2%] 2025-12-04T09:31:25.6962754Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.5071s] [ 2%] 2025-12-04T09:31:25.6963035Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.4947s] [ 2%] 2025-12-04T09:31:25.6963315Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2440s] [ 2%] 2025-12-04T09:31:25.6963602Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2472s] [ 2%] 2025-12-04T09:31:25.6963881Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2453s] [ 2%] 2025-12-04T09:31:25.6964163Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.9214s] [ 2%] 2025-12-04T09:31:25.6964443Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2768s] [ 2%] 2025-12-04T09:31:25.6964721Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2506s] [ 2%] 2025-12-04T09:31:25.6965002Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2686s] [ 2%] 2025-12-04T09:31:25.6965282Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2526s] [ 2%] 2025-12-04T09:31:25.6965565Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.4851s] [ 2%] 2025-12-04T09:31:25.6965844Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.4715s] [ 2%] 2025-12-04T09:31:25.6966264Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2467s] [ 2%] 2025-12-04T09:31:25.6966546Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.5022s] [ 2%] 2025-12-04T09:31:25.6966824Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.5380s] [ 2%] 2025-12-04T09:31:25.6967156Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.4983s] [ 2%] 2025-12-04T09:31:25.6967435Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2504s] [ 2%] 2025-12-04T09:31:25.6967712Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.4776s] [ 2%] 2025-12-04T09:31:25.6967993Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2496s] [ 2%] 2025-12-04T09:31:25.6968272Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2445s] [ 2%] 2025-12-04T09:31:25.6968551Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.5056s] [ 2%] 2025-12-04T09:31:25.6968834Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2761s] [ 2%] 2025-12-04T09:31:25.6969116Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2430s] [ 2%] 2025-12-04T09:31:25.6969397Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.5208s] [ 2%] 2025-12-04T09:31:25.6969675Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.5114s] [ 2%] 2025-12-04T09:31:25.6970008Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2585s] [ 2%] 2025-12-04T09:31:25.6970286Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.6904s] [ 2%] 2025-12-04T09:31:25.6970563Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2427s] [ 2%] 2025-12-04T09:31:25.6970841Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2631s] [ 2%] 2025-12-04T09:31:25.6971117Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.5441s] [ 2%] 2025-12-04T09:31:25.6971393Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2497s] [ 2%] 2025-12-04T09:31:25.6971672Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.5193s] [ 2%] 2025-12-04T09:31:25.6971950Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2388s] [ 2%] 2025-12-04T09:31:25.6972231Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.5305s] [ 2%] 2025-12-04T09:31:25.6972512Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.5236s] [ 2%] 2025-12-04T09:31:25.6972795Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.4855s] [ 2%] 2025-12-04T09:31:25.6973073Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2603s] [ 2%] 2025-12-04T09:31:25.6973351Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2563s] [ 2%] 2025-12-04T09:31:25.6973630Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.5035s] [ 2%] 2025-12-04T09:31:25.6973910Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.5223s] [ 2%] 2025-12-04T09:31:25.6974187Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2502s] [ 2%] 2025-12-04T09:31:25.6974466Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.4919s] [ 2%] 2025-12-04T09:31:25.6974742Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.5330s] [ 2%] 2025-12-04T09:31:25.6975020Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.5010s] [ 2%] 2025-12-04T09:31:25.6975297Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda FAILED [0.2144s] [ 2%] 2025-12-04T09:31:25.6975575Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.5163s] [ 2%] 2025-12-04T09:31:25.6975879Z inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda PASSED [0.9519s] [ 2%] 2025-12-04T09:31:25.6976089Z 2025-12-04T09:31:25.6976153Z =================================== FAILURES =================================== 2025-12-04T09:31:25.6976338Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.6976512Z Traceback (most recent call last): 2025-12-04T09:31:25.6976974Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.6977180Z self.common( 2025-12-04T09:31:25.6977338Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.6977572Z return func(*args, **kwds) 2025-12-04T09:31:25.6977779Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.6977979Z check_model( 2025-12-04T09:31:25.6978157Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.6978357Z assert_equal_fn( 2025-12-04T09:31:25.6978567Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.6978856Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.6979119Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.6979394Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.6979575Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.6979671Z 2025-12-04T09:31:25.6979724Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.6979912Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.6980156Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.6980297Z 2025-12-04T09:31:25.6980354Z The failure occurred for item [2] 2025-12-04T09:31:25.6980433Z 2025-12-04T09:31:25.6980514Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.6980777Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.6980968Z 2025-12-04T09:31:25.6981066Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.6981280Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.6981460Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.6981664Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.6982164Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.6982623Z graph_break [] 2025-12-04T09:31:25.6982769Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.6982942Z Traceback (most recent call last): 2025-12-04T09:31:25.6983158Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.6983364Z self.common( 2025-12-04T09:31:25.6983519Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.6983697Z return func(*args, **kwds) 2025-12-04T09:31:25.6983897Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 725, in check_model_gpu 2025-12-04T09:31:25.6984104Z check_model( 2025-12-04T09:31:25.6984282Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.6984481Z assert_equal_fn( 2025-12-04T09:31:25.6984752Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.6984994Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.6985254Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.6985528Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.6985696Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.6985791Z 2025-12-04T09:31:25.6985839Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.6986997Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.6987226Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.6987360Z 2025-12-04T09:31:25.6987409Z The failure occurred for item [2] 2025-12-04T09:31:25.6987491Z 2025-12-04T09:31:25.6987565Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.6987829Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.6988019Z 2025-12-04T09:31:25.6988110Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.6988359Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.6988536Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.6988737Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.6989231Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.6989658Z graph_break [] 2025-12-04T09:31:25.6989796Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.6989972Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.6990172Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.6990663Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.6991088Z graph_break [] 2025-12-04T09:31:25.6991222Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.6991397Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.6991595Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.6992086Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.6992516Z graph_break [] 2025-12-04T09:31:25.6992648Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.6992824Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.6993021Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.6993504Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.6993926Z graph_break [] 2025-12-04T09:31:25.6994101Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.6994274Z Traceback (most recent call last): 2025-12-04T09:31:25.6994486Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.6994694Z self.common( 2025-12-04T09:31:25.6994846Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.6995022Z return func(*args, **kwds) 2025-12-04T09:31:25.6995225Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.6995429Z check_model( 2025-12-04T09:31:25.6995608Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.6995806Z assert_equal_fn( 2025-12-04T09:31:25.6996052Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.6996292Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.6996553Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.6996828Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.6997031Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.6997124Z 2025-12-04T09:31:25.6997172Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.6997358Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.6997602Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.6997747Z 2025-12-04T09:31:25.6997798Z The failure occurred for item [2] 2025-12-04T09:31:25.6997881Z 2025-12-04T09:31:25.6997955Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.6998212Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.6998400Z 2025-12-04T09:31:25.6998494Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.6998700Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.6998878Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.6999080Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.6999571Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.6999996Z graph_break [] 2025-12-04T09:31:25.7000133Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7000314Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7000512Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7000998Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7001426Z graph_break [] 2025-12-04T09:31:25.7001560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7001736Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7001931Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7002451Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7002872Z graph_break [] 2025-12-04T09:31:25.7003008Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7003191Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7003396Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7018764Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7019306Z graph_break [] 2025-12-04T09:31:25.7019462Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7019652Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7019871Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7020366Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7020901Z graph_break [] 2025-12-04T09:31:25.7021048Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7021230Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7021432Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7021926Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7022356Z graph_break [] 2025-12-04T09:31:25.7022505Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7022686Z Traceback (most recent call last): 2025-12-04T09:31:25.7022907Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7023115Z self.common( 2025-12-04T09:31:25.7023267Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7023441Z return func(*args, **kwds) 2025-12-04T09:31:25.7023646Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7023848Z check_model( 2025-12-04T09:31:25.7024026Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7024221Z assert_equal_fn( 2025-12-04T09:31:25.7024431Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7024670Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7024928Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7025207Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7025375Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7025469Z 2025-12-04T09:31:25.7025516Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7025702Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7025984Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7026128Z 2025-12-04T09:31:25.7026175Z The failure occurred for item [2] 2025-12-04T09:31:25.7026259Z 2025-12-04T09:31:25.7026337Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7026634Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7026824Z 2025-12-04T09:31:25.7026915Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7027126Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7027301Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7027502Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7027994Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7028413Z graph_break [] 2025-12-04T09:31:25.7028547Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7028723Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7028920Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7029439Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7029862Z graph_break [] 2025-12-04T09:31:25.7029993Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7030166Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7030361Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7030846Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7031270Z graph_break [] 2025-12-04T09:31:25.7031404Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7031577Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7031769Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7032250Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7032669Z graph_break [] 2025-12-04T09:31:25.7032801Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7032973Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7033164Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7033647Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7034070Z graph_break [] 2025-12-04T09:31:25.7034199Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7034370Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7034564Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7035068Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7035497Z graph_break [] 2025-12-04T09:31:25.7035625Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7035792Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7036017Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7036494Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7036913Z graph_break [] 2025-12-04T09:31:25.7037052Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7037217Z Traceback (most recent call last): 2025-12-04T09:31:25.7037422Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7037655Z self.common( 2025-12-04T09:31:25.7037797Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7037966Z return func(*args, **kwds) 2025-12-04T09:31:25.7038166Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7038366Z check_model( 2025-12-04T09:31:25.7038536Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7038730Z assert_equal_fn( 2025-12-04T09:31:25.7038931Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7039168Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7039425Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7039696Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7039859Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7039950Z 2025-12-04T09:31:25.7039996Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7040175Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7040419Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7040562Z 2025-12-04T09:31:25.7040610Z The failure occurred for item [2] 2025-12-04T09:31:25.7040690Z 2025-12-04T09:31:25.7040764Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7041020Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7041207Z 2025-12-04T09:31:25.7041295Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7041494Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7041665Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7041859Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7042348Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7042767Z graph_break [] 2025-12-04T09:31:25.7042897Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7043068Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7043292Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7043772Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7044191Z graph_break [] 2025-12-04T09:31:25.7044319Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7044489Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7044679Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7045160Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7045577Z graph_break [] 2025-12-04T09:31:25.7045704Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7045900Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7046158Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7046633Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7047054Z graph_break [] 2025-12-04T09:31:25.7047182Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7047354Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7047547Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7048026Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7048446Z graph_break [] 2025-12-04T09:31:25.7048574Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7048747Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7048937Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7049418Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7049839Z graph_break [] 2025-12-04T09:31:25.7049963Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7050132Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7050323Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7050802Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7051222Z graph_break [] 2025-12-04T09:31:25.7051345Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7051512Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7051745Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7052223Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7052646Z graph_break [] 2025-12-04T09:31:25.7052781Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7052947Z Traceback (most recent call last): 2025-12-04T09:31:25.7053151Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7053350Z self.common( 2025-12-04T09:31:25.7053494Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7053663Z return func(*args, **kwds) 2025-12-04T09:31:25.7053863Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7054062Z check_model( 2025-12-04T09:31:25.7054233Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7054457Z assert_equal_fn( 2025-12-04T09:31:25.7054655Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7054890Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7055142Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7055411Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7055571Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7055661Z 2025-12-04T09:31:25.7055706Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7055887Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7056180Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7056322Z 2025-12-04T09:31:25.7056368Z The failure occurred for item [2] 2025-12-04T09:31:25.7056452Z 2025-12-04T09:31:25.7056525Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7056781Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7056966Z 2025-12-04T09:31:25.7057055Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7057253Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7057423Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7057616Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7058102Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7058524Z graph_break [] 2025-12-04T09:31:25.7058651Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7058820Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7059009Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7059490Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7059910Z graph_break [] 2025-12-04T09:31:25.7060070Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7060242Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7060433Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7060912Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7061333Z graph_break [] 2025-12-04T09:31:25.7061461Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7061631Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7061829Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7062309Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7062755Z graph_break [] 2025-12-04T09:31:25.7062882Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7063049Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7063236Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7063710Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7064125Z graph_break [] 2025-12-04T09:31:25.7064251Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7064417Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7064605Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7065081Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7065492Z graph_break [] 2025-12-04T09:31:25.7065617Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7065782Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7066018Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7066498Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7066917Z graph_break [] 2025-12-04T09:31:25.7067041Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7067208Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7067395Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7067869Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7068285Z graph_break [] 2025-12-04T09:31:25.7068444Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7068616Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7068805Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7069283Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7069699Z graph_break [] 2025-12-04T09:31:25.7069824Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7069990Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7070179Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7070662Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7071112Z graph_break [] 2025-12-04T09:31:25.7071248Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7071411Z Traceback (most recent call last): 2025-12-04T09:31:25.7071615Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7071810Z self.common( 2025-12-04T09:31:25.7071952Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7072117Z return func(*args, **kwds) 2025-12-04T09:31:25.7072313Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7072513Z check_model( 2025-12-04T09:31:25.7072684Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7072874Z assert_equal_fn( 2025-12-04T09:31:25.7073071Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7073307Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7073558Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7073824Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7073982Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7074068Z 2025-12-04T09:31:25.7074115Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7074293Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7074530Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7074667Z 2025-12-04T09:31:25.7074716Z The failure occurred for item [2] 2025-12-04T09:31:25.7074792Z 2025-12-04T09:31:25.7074866Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7075117Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7075305Z 2025-12-04T09:31:25.7075396Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7075593Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7075760Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7076036Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7076552Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7076973Z graph_break [] 2025-12-04T09:31:25.7077102Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7077277Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7077468Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7077944Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7078360Z graph_break [] 2025-12-04T09:31:25.7078489Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7078662Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7078860Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7079336Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7079783Z graph_break [] 2025-12-04T09:31:25.7079908Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7080078Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7080266Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7080748Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7081166Z graph_break [] 2025-12-04T09:31:25.7081296Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7081470Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7081661Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7082140Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7082563Z graph_break [] 2025-12-04T09:31:25.7082689Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7082858Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7083053Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7083534Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7083959Z graph_break [] 2025-12-04T09:31:25.7084089Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7084259Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7084451Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7084960Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7085380Z graph_break [] 2025-12-04T09:31:25.7085511Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7085685Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7085878Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7086461Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7086881Z graph_break [] 2025-12-04T09:31:25.7087012Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7087182Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7087379Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7087860Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7088313Z graph_break [] 2025-12-04T09:31:25.7088441Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7088611Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7088805Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7089287Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7089710Z graph_break [] 2025-12-04T09:31:25.7089840Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7090015Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7090208Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7090691Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7091116Z graph_break [] 2025-12-04T09:31:25.7091254Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7091427Z Traceback (most recent call last): 2025-12-04T09:31:25.7091639Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7091841Z self.common( 2025-12-04T09:31:25.7091990Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7092164Z return func(*args, **kwds) 2025-12-04T09:31:25.7092362Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7092564Z check_model( 2025-12-04T09:31:25.7092735Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7092929Z assert_equal_fn( 2025-12-04T09:31:25.7093129Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7093365Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7093618Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7093925Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7094088Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7094175Z 2025-12-04T09:31:25.7094227Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7094412Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7094654Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7094793Z 2025-12-04T09:31:25.7094844Z The failure occurred for item [2] 2025-12-04T09:31:25.7094923Z 2025-12-04T09:31:25.7095001Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7095257Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7095443Z 2025-12-04T09:31:25.7095538Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7095742Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7095913Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7096149Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7096667Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7097084Z graph_break [] 2025-12-04T09:31:25.7097215Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7097389Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7097583Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7098069Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7098489Z graph_break [] 2025-12-04T09:31:25.7098619Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7098795Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7098990Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7099473Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7099892Z graph_break [] 2025-12-04T09:31:25.7100024Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7100196Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7100391Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7100878Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7101298Z graph_break [] 2025-12-04T09:31:25.7101428Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7101598Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7101793Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7102311Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7102732Z graph_break [] 2025-12-04T09:31:25.7102860Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7103030Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7103223Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7103699Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7104122Z graph_break [] 2025-12-04T09:31:25.7104256Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7104428Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7104617Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7105134Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7105554Z graph_break [] 2025-12-04T09:31:25.7105683Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7105852Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7106082Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7106569Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7106994Z graph_break [] 2025-12-04T09:31:25.7107124Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7107300Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7107500Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7107989Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7108416Z graph_break [] 2025-12-04T09:31:25.7108553Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7108730Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7108929Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7109422Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7109849Z graph_break [] 2025-12-04T09:31:25.7109987Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7110164Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7110363Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7111310Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7111741Z graph_break [] 2025-12-04T09:31:25.7111878Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7112057Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7112257Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7112743Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7113166Z graph_break [] 2025-12-04T09:31:25.7113315Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7113490Z Traceback (most recent call last): 2025-12-04T09:31:25.7113705Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7113952Z self.common( 2025-12-04T09:31:25.7114106Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7114284Z return func(*args, **kwds) 2025-12-04T09:31:25.7114490Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7114697Z check_model( 2025-12-04T09:31:25.7114879Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7115079Z assert_equal_fn( 2025-12-04T09:31:25.7115286Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7115531Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7115795Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7116113Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7116283Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7116373Z 2025-12-04T09:31:25.7116426Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7116614Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7116862Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7117003Z 2025-12-04T09:31:25.7117059Z The failure occurred for item [2] 2025-12-04T09:31:25.7117139Z 2025-12-04T09:31:25.7117216Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7117476Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7117661Z 2025-12-04T09:31:25.7117761Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7117968Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7118144Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7118342Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7118835Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7119264Z graph_break [] 2025-12-04T09:31:25.7119403Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7119584Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7119824Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7120313Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7120742Z graph_break [] 2025-12-04T09:31:25.7120880Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7121053Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7121252Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7121735Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7122164Z graph_break [] 2025-12-04T09:31:25.7122301Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7122512Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7122712Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7123203Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7123630Z graph_break [] 2025-12-04T09:31:25.7123769Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7123948Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7124150Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7124638Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7125074Z graph_break [] 2025-12-04T09:31:25.7125211Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7125386Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7125578Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7126171Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7126591Z graph_break [] 2025-12-04T09:31:25.7126721Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7126894Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7127086Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7127567Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7127986Z graph_break [] 2025-12-04T09:31:25.7128113Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7128283Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7128508Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7128990Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7129413Z graph_break [] 2025-12-04T09:31:25.7129543Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7129718Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7129913Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7130398Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7130819Z graph_break [] 2025-12-04T09:31:25.7130951Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7131122Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7131347Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7131834Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7132252Z graph_break [] 2025-12-04T09:31:25.7132382Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7132553Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7132748Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7133227Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7133654Z graph_break [] 2025-12-04T09:31:25.7133785Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7133958Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7134148Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7134629Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7135049Z graph_break [] 2025-12-04T09:31:25.7135178Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7135348Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7135543Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7136061Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7136480Z graph_break [] 2025-12-04T09:31:25.7136618Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7136787Z Traceback (most recent call last): 2025-12-04T09:31:25.7137036Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7137239Z self.common( 2025-12-04T09:31:25.7137386Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7137557Z return func(*args, **kwds) 2025-12-04T09:31:25.7137760Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7137960Z check_model( 2025-12-04T09:31:25.7138135Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7138330Z assert_equal_fn( 2025-12-04T09:31:25.7138532Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7138770Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7139027Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7139303Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7139467Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7139555Z 2025-12-04T09:31:25.7139605Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7139823Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7140070Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7140213Z 2025-12-04T09:31:25.7140267Z The failure occurred for item [2] 2025-12-04T09:31:25.7140347Z 2025-12-04T09:31:25.7140428Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7140696Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7140879Z 2025-12-04T09:31:25.7140977Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7141186Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7141362Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7141563Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7142061Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7142490Z graph_break [] 2025-12-04T09:31:25.7142629Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7142809Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7143012Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7143509Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7143941Z graph_break [] 2025-12-04T09:31:25.7144082Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7144263Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7144464Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7144951Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7145379Z graph_break [] 2025-12-04T09:31:25.7145540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7145720Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7145962Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7146454Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7146882Z graph_break [] 2025-12-04T09:31:25.7147020Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7147200Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7147401Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7147892Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7148356Z graph_break [] 2025-12-04T09:31:25.7148481Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7148648Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7148837Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7149314Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7149728Z graph_break [] 2025-12-04T09:31:25.7149858Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7150024Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7150213Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7150703Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7151119Z graph_break [] 2025-12-04T09:31:25.7151245Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7151412Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7151602Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7152081Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7152498Z graph_break [] 2025-12-04T09:31:25.7152629Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7152801Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7152990Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7153469Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7153885Z graph_break [] 2025-12-04T09:31:25.7154012Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7154213Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7154404Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7154885Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7155303Z graph_break [] 2025-12-04T09:31:25.7155430Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7155601Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7155790Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7156383Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7156799Z graph_break [] 2025-12-04T09:31:25.7156957Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7157127Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7157316Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7157794Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7158211Z graph_break [] 2025-12-04T09:31:25.7158337Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7158504Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7158692Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7159169Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7159590Z graph_break [] 2025-12-04T09:31:25.7159717Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7159887Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7160077Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7160563Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7160980Z graph_break [] 2025-12-04T09:31:25.7161112Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7161282Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7161474Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7161955Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7162375Z graph_break [] 2025-12-04T09:31:25.7162502Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7162701Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7162892Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7163374Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7163792Z graph_break [] 2025-12-04T09:31:25.7163928Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7164097Z Traceback (most recent call last): 2025-12-04T09:31:25.7164307Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7164517Z self.common( 2025-12-04T09:31:25.7164670Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7164848Z return func(*args, **kwds) 2025-12-04T09:31:25.7165056Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7165266Z check_model( 2025-12-04T09:31:25.7165476Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7165678Z assert_equal_fn( 2025-12-04T09:31:25.7165888Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7166178Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7166443Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7166722Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7166892Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7166987Z 2025-12-04T09:31:25.7167039Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7167230Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7167479Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7167632Z 2025-12-04T09:31:25.7167681Z The failure occurred for item [2] 2025-12-04T09:31:25.7167762Z 2025-12-04T09:31:25.7167836Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7168091Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7168273Z 2025-12-04T09:31:25.7168362Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7168571Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7168748Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7168946Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7169431Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7169854Z graph_break [] 2025-12-04T09:31:25.7169986Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7170158Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7170355Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7170872Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7171295Z graph_break [] 2025-12-04T09:31:25.7171427Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7171602Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7171800Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7172282Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7172708Z graph_break [] 2025-12-04T09:31:25.7172847Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7173026Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7173229Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7173721Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7174180Z graph_break [] 2025-12-04T09:31:25.7174318Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7174497Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7174697Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7175185Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7175615Z graph_break [] 2025-12-04T09:31:25.7175752Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7175973Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7176177Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7176663Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7177088Z graph_break [] 2025-12-04T09:31:25.7177224Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7177401Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7177605Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7178091Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7178520Z graph_break [] 2025-12-04T09:31:25.7178656Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7178833Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7179033Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7179524Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7179988Z graph_break [] 2025-12-04T09:31:25.7180129Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7180308Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7180511Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7180998Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7181425Z graph_break [] 2025-12-04T09:31:25.7181562Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7181742Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7181946Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7182434Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7182895Z graph_break [] 2025-12-04T09:31:25.7183030Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7183205Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7183401Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7183886Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7184316Z graph_break [] 2025-12-04T09:31:25.7184450Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7184625Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7184826Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7185317Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7185747Z graph_break [] 2025-12-04T09:31:25.7185883Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7186090Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7186288Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7186777Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7187212Z graph_break [] 2025-12-04T09:31:25.7187344Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7187523Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7187723Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7188213Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7188677Z graph_break [] 2025-12-04T09:31:25.7188809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7188988Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7189191Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7189681Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7190111Z graph_break [] 2025-12-04T09:31:25.7190248Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7190424Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7190619Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7191114Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7191573Z graph_break [] 2025-12-04T09:31:25.7191650Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7191717Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7191818Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7192172Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7192212Z graph_break [] 2025-12-04T09:31:25.7192296Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7192358Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7192467Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7192816Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7192862Z graph_break [] 2025-12-04T09:31:25.7192938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7193004Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7193104Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7193460Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7193505Z graph_break [] 2025-12-04T09:31:25.7193586Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7193644Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7193749Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7194103Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7194144Z graph_break [] 2025-12-04T09:31:25.7194257Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7194308Z Traceback (most recent call last): 2025-12-04T09:31:25.7194443Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7194486Z self.common( 2025-12-04T09:31:25.7194585Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7194632Z return func(*args, **kwds) 2025-12-04T09:31:25.7194767Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7194808Z check_model( 2025-12-04T09:31:25.7194933Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7194975Z assert_equal_fn( 2025-12-04T09:31:25.7195122Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7195187Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7195355Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7195456Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7195518Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7195520Z 2025-12-04T09:31:25.7195568Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7195679Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7195784Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7195786Z 2025-12-04T09:31:25.7195839Z The failure occurred for item [2] 2025-12-04T09:31:25.7195842Z 2025-12-04T09:31:25.7195919Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7196163Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7196165Z 2025-12-04T09:31:25.7196258Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7196338Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7208022Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7208174Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7208528Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7208573Z graph_break [] 2025-12-04T09:31:25.7208656Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7208723Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7208831Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7209183Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7209228Z graph_break [] 2025-12-04T09:31:25.7209304Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7209369Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7209468Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7209884Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7209925Z graph_break [] 2025-12-04T09:31:25.7210008Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7210067Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7210167Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7210510Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7210553Z graph_break [] 2025-12-04T09:31:25.7210626Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7210689Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7210785Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7211133Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7211206Z graph_break [] 2025-12-04T09:31:25.7211279Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7211335Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7211430Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7211782Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7211819Z graph_break [] 2025-12-04T09:31:25.7211895Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7211955Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7212052Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7212398Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7212439Z graph_break [] 2025-12-04T09:31:25.7212511Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7212569Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7212664Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7213013Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7213052Z graph_break [] 2025-12-04T09:31:25.7213127Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7213185Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7213282Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7213659Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7213696Z graph_break [] 2025-12-04T09:31:25.7213771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7213828Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7213926Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7214274Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7214315Z graph_break [] 2025-12-04T09:31:25.7214389Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7214446Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7214542Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7214885Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7214947Z graph_break [] 2025-12-04T09:31:25.7215022Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7215076Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7215173Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7215521Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7215570Z graph_break [] 2025-12-04T09:31:25.7215644Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7215702Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7215798Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7216194Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7216233Z graph_break [] 2025-12-04T09:31:25.7216307Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7216367Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7216464Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7216814Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7216853Z graph_break [] 2025-12-04T09:31:25.7216929Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7216985Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7217082Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7217454Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7217495Z graph_break [] 2025-12-04T09:31:25.7217567Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7217629Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7217724Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7218074Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7218112Z graph_break [] 2025-12-04T09:31:25.7218185Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7218242Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7218338Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7218689Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7218760Z graph_break [] 2025-12-04T09:31:25.7218836Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7218892Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7218988Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7219334Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7219374Z graph_break [] 2025-12-04T09:31:25.7219446Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7219508Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7219603Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7219949Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7219985Z graph_break [] 2025-12-04T09:31:25.7220062Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7220119Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7220220Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7220567Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7220607Z graph_break [] 2025-12-04T09:31:25.7220682Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7220739Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7220834Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7221201Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7221240Z graph_break [] 2025-12-04T09:31:25.7221312Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7221372Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7221467Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7221813Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7221850Z graph_break [] 2025-12-04T09:31:25.7221934Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7221982Z Traceback (most recent call last): 2025-12-04T09:31:25.7222118Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7222156Z self.common( 2025-12-04T09:31:25.7222250Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7222317Z return func(*args, **kwds) 2025-12-04T09:31:25.7222446Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7222485Z check_model( 2025-12-04T09:31:25.7222604Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7222642Z assert_equal_fn( 2025-12-04T09:31:25.7222786Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7222848Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7223012Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7223086Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7223141Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7223144Z 2025-12-04T09:31:25.7223192Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7223297Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7223399Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7223401Z 2025-12-04T09:31:25.7223450Z The failure occurred for item [2] 2025-12-04T09:31:25.7223452Z 2025-12-04T09:31:25.7223525Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7223677Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7223680Z 2025-12-04T09:31:25.7223771Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7223848Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7223905Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7224003Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7224349Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7224386Z graph_break [] 2025-12-04T09:31:25.7224461Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7224518Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7224615Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7224991Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7225032Z graph_break [] 2025-12-04T09:31:25.7225103Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7225162Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7225257Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7225603Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7225639Z graph_break [] 2025-12-04T09:31:25.7225714Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7225771Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7225868Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7226312Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7226348Z graph_break [] 2025-12-04T09:31:25.7226422Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7226478Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7226575Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7226923Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7226964Z graph_break [] 2025-12-04T09:31:25.7227036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7227093Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7227188Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7227530Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7227566Z graph_break [] 2025-12-04T09:31:25.7227642Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7227695Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7227793Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7228138Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7228176Z graph_break [] 2025-12-04T09:31:25.7228248Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7228305Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7228401Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7228777Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7228818Z graph_break [] 2025-12-04T09:31:25.7228890Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7228949Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7229044Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7229389Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7229425Z graph_break [] 2025-12-04T09:31:25.7229505Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7229562Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7229661Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7230031Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7230069Z graph_break [] 2025-12-04T09:31:25.7230141Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7230199Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7230295Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7230641Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7230679Z graph_break [] 2025-12-04T09:31:25.7230752Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7230806Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7230904Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7231249Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7231286Z graph_break [] 2025-12-04T09:31:25.7231361Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7231416Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7231512Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7231856Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7231894Z graph_break [] 2025-12-04T09:31:25.7231966Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7232025Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7232120Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7232491Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7232530Z graph_break [] 2025-12-04T09:31:25.7232604Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7232662Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7232760Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7233106Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7233145Z graph_break [] 2025-12-04T09:31:25.7233221Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7233274Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7233372Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7233738Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7233778Z graph_break [] 2025-12-04T09:31:25.7233850Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7233910Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7234006Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7234354Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7234392Z graph_break [] 2025-12-04T09:31:25.7234467Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7234523Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7234620Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7234964Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7235002Z graph_break [] 2025-12-04T09:31:25.7235076Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7235134Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7235228Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7235576Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7235614Z graph_break [] 2025-12-04T09:31:25.7235687Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7235744Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7235839Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7236302Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7236341Z graph_break [] 2025-12-04T09:31:25.7236415Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7236472Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7236569Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7236911Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7236950Z graph_break [] 2025-12-04T09:31:25.7237024Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7237080Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7237176Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7237548Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7237584Z graph_break [] 2025-12-04T09:31:25.7237659Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7237712Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7237811Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7238157Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7238194Z graph_break [] 2025-12-04T09:31:25.7238278Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7238324Z Traceback (most recent call last): 2025-12-04T09:31:25.7238453Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7238490Z self.common( 2025-12-04T09:31:25.7238581Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7238625Z return func(*args, **kwds) 2025-12-04T09:31:25.7238754Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 725, in check_model_gpu 2025-12-04T09:31:25.7238791Z check_model( 2025-12-04T09:31:25.7238910Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7238949Z assert_equal_fn( 2025-12-04T09:31:25.7239093Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7239155Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7239318Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7239391Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7239446Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7239448Z 2025-12-04T09:31:25.7239493Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7239591Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7239683Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7239685Z 2025-12-04T09:31:25.7239793Z The failure occurred for item [2] 2025-12-04T09:31:25.7239795Z 2025-12-04T09:31:25.7239868Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7240019Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7240023Z 2025-12-04T09:31:25.7240111Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7240186Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7240240Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7240339Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7240684Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7240722Z graph_break [] 2025-12-04T09:31:25.7240796Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7240884Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7240981Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7241326Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7241364Z graph_break [] 2025-12-04T09:31:25.7241436Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7241494Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7241591Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7241939Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7241977Z graph_break [] 2025-12-04T09:31:25.7242049Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7242105Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7242202Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7242547Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7242586Z graph_break [] 2025-12-04T09:31:25.7242658Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7242717Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7242812Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7243157Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7243196Z graph_break [] 2025-12-04T09:31:25.7243269Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7243327Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7243446Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7243795Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7243833Z graph_break [] 2025-12-04T09:31:25.7243908Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7243964Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7244060Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7244404Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7244443Z graph_break [] 2025-12-04T09:31:25.7244514Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7244570Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7244688Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7245034Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7245070Z graph_break [] 2025-12-04T09:31:25.7245144Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7245201Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7245298Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7245644Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7245683Z graph_break [] 2025-12-04T09:31:25.7245757Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7245810Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7245908Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7246291Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7246329Z graph_break [] 2025-12-04T09:31:25.7246400Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7246457Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7246555Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7246899Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7246935Z graph_break [] 2025-12-04T09:31:25.7247008Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7247062Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7247194Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7247537Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7247578Z graph_break [] 2025-12-04T09:31:25.7247650Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7247704Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7247800Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7248148Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7248184Z graph_break [] 2025-12-04T09:31:25.7248258Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7248314Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7248440Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7248785Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7248822Z graph_break [] 2025-12-04T09:31:25.7248893Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7248954Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7249050Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7249395Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7249435Z graph_break [] 2025-12-04T09:31:25.7249507Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7249562Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7249657Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7250005Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7250043Z graph_break [] 2025-12-04T09:31:25.7250119Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7250175Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7250273Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7250617Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7250657Z graph_break [] 2025-12-04T09:31:25.7250729Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7250786Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7250903Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7251249Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7251287Z graph_break [] 2025-12-04T09:31:25.7251361Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7251417Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7251513Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7251857Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7251895Z graph_break [] 2025-12-04T09:31:25.7251971Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7252025Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7252144Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7252490Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7252528Z graph_break [] 2025-12-04T09:31:25.7252600Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7252657Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7252754Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7253099Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7253136Z graph_break [] 2025-12-04T09:31:25.7253210Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7253264Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7253360Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7253700Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7253740Z graph_break [] 2025-12-04T09:31:25.7253812Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7253866Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7253961Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7254308Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7254345Z graph_break [] 2025-12-04T09:31:25.7254416Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7254474Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7254568Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7254943Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7254982Z graph_break [] 2025-12-04T09:31:25.7255065Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7255110Z Traceback (most recent call last): 2025-12-04T09:31:25.7255237Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7255272Z self.common( 2025-12-04T09:31:25.7255363Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7255406Z return func(*args, **kwds) 2025-12-04T09:31:25.7255533Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7255570Z check_model( 2025-12-04T09:31:25.7255687Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7255725Z assert_equal_fn( 2025-12-04T09:31:25.7255886Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7255987Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7256149Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7256220Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7256273Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7256275Z 2025-12-04T09:31:25.7256320Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7256423Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7256526Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7256528Z 2025-12-04T09:31:25.7256575Z The failure occurred for item [2] 2025-12-04T09:31:25.7256577Z 2025-12-04T09:31:25.7256649Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7256798Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7256800Z 2025-12-04T09:31:25.7256887Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7256960Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7257016Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7257113Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7257463Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7257499Z graph_break [] 2025-12-04T09:31:25.7257573Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7257630Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7257726Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7258076Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7258116Z graph_break [] 2025-12-04T09:31:25.7258188Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7258277Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7258373Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7258719Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7258756Z graph_break [] 2025-12-04T09:31:25.7258829Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7258886Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7258982Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7259331Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7259367Z graph_break [] 2025-12-04T09:31:25.7259466Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7259522Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7259618Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7259961Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7260001Z graph_break [] 2025-12-04T09:31:25.7260074Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7260131Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7260227Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7260573Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7260611Z graph_break [] 2025-12-04T09:31:25.7260685Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7260738Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7260833Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7261177Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7261215Z graph_break [] 2025-12-04T09:31:25.7261288Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7261343Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7261437Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7261780Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7261818Z graph_break [] 2025-12-04T09:31:25.7261891Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7261969Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7262064Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7262411Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7262449Z graph_break [] 2025-12-04T09:31:25.7262523Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7262576Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7262673Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7263018Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7263055Z graph_break [] 2025-12-04T09:31:25.7263127Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7263216Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7263310Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7263656Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7263691Z graph_break [] 2025-12-04T09:31:25.7263765Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7263821Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7263917Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7264259Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7264296Z graph_break [] 2025-12-04T09:31:25.7264369Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7264422Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7264518Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7264863Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7264900Z graph_break [] 2025-12-04T09:31:25.7264971Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7265029Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7265123Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7265468Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7265505Z graph_break [] 2025-12-04T09:31:25.7265577Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7265656Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7265754Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7266159Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7266200Z graph_break [] 2025-12-04T09:31:25.7266273Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7266326Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7266423Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7266768Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7266806Z graph_break [] 2025-12-04T09:31:25.7266878Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7266963Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7267057Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7267402Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7267440Z graph_break [] 2025-12-04T09:31:25.7267512Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7267570Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7267666Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7268014Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7268054Z graph_break [] 2025-12-04T09:31:25.7268127Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7268185Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7268280Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7268627Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7268664Z graph_break [] 2025-12-04T09:31:25.7268737Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7268794Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7268890Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7269236Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7269273Z graph_break [] 2025-12-04T09:31:25.7269345Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7269435Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7269532Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7269881Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7269920Z graph_break [] 2025-12-04T09:31:25.7269992Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7270047Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7270142Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7270489Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7270525Z graph_break [] 2025-12-04T09:31:25.7270600Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7270676Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7270773Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7271117Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7271153Z graph_break [] 2025-12-04T09:31:25.7271227Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7271282Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7271381Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7271730Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7271770Z graph_break [] 2025-12-04T09:31:25.7271843Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7271897Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7271992Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7272338Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7272373Z graph_break [] 2025-12-04T09:31:25.7272456Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7272502Z Traceback (most recent call last): 2025-12-04T09:31:25.7272629Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7272664Z self.common( 2025-12-04T09:31:25.7272754Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7272797Z return func(*args, **kwds) 2025-12-04T09:31:25.7272922Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7272959Z check_model( 2025-12-04T09:31:25.7273075Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7273113Z assert_equal_fn( 2025-12-04T09:31:25.7273276Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7273335Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7273496Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7273569Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7273622Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7273625Z 2025-12-04T09:31:25.7273669Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7273771Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7273873Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7273877Z 2025-12-04T09:31:25.7273924Z The failure occurred for item [2] 2025-12-04T09:31:25.7273926Z 2025-12-04T09:31:25.7274005Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7274151Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7274177Z 2025-12-04T09:31:25.7274265Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7274337Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7274393Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7274490Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7274838Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7274875Z graph_break [] 2025-12-04T09:31:25.7274949Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7275006Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7275103Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7275449Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7275486Z graph_break [] 2025-12-04T09:31:25.7275560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7275617Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7275712Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7276111Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7276151Z graph_break [] 2025-12-04T09:31:25.7276222Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7276283Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7276377Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7276723Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7276758Z graph_break [] 2025-12-04T09:31:25.7276861Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7276917Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7277015Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7277360Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7277400Z graph_break [] 2025-12-04T09:31:25.7277472Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7277529Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7277623Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7277974Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7278041Z graph_break [] 2025-12-04T09:31:25.7278114Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7278169Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7278266Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7278611Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7278647Z graph_break [] 2025-12-04T09:31:25.7278721Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7278775Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7278871Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7279215Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7279252Z graph_break [] 2025-12-04T09:31:25.7279323Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7279379Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7279472Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7279820Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7279858Z graph_break [] 2025-12-04T09:31:25.7279933Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7279986Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7280084Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7280428Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7280467Z graph_break [] 2025-12-04T09:31:25.7280561Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7280617Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7280711Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7281058Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7281095Z graph_break [] 2025-12-04T09:31:25.7281166Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7281220Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7281315Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7281660Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7281726Z graph_break [] 2025-12-04T09:31:25.7281799Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7281853Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7281950Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7282295Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7282332Z graph_break [] 2025-12-04T09:31:25.7282406Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7282464Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7282558Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7282908Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7282944Z graph_break [] 2025-12-04T09:31:25.7283017Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7283072Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7283169Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7283516Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7283553Z graph_break [] 2025-12-04T09:31:25.7283629Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7283682Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7283779Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7284123Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7284160Z graph_break [] 2025-12-04T09:31:25.7284254Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7284313Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7284407Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7284756Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7284793Z graph_break [] 2025-12-04T09:31:25.7284867Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7284923Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7285022Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7285368Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7285430Z graph_break [] 2025-12-04T09:31:25.7285503Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7285558Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7285654Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7286045Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7286083Z graph_break [] 2025-12-04T09:31:25.7286157Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7286211Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7286307Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7286658Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7286693Z graph_break [] 2025-12-04T09:31:25.7286765Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7286820Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7286916Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7287262Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7287301Z graph_break [] 2025-12-04T09:31:25.7287373Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7287428Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7287524Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7287872Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7287909Z graph_break [] 2025-12-04T09:31:25.7288011Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7288068Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7288163Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7288512Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7288548Z graph_break [] 2025-12-04T09:31:25.7288622Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7288678Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7288776Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7289122Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7289189Z graph_break [] 2025-12-04T09:31:25.7289262Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7289316Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7289411Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7289756Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7289792Z graph_break [] 2025-12-04T09:31:25.7289867Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7289920Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7290018Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7290369Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7290404Z graph_break [] 2025-12-04T09:31:25.7290487Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7290533Z Traceback (most recent call last): 2025-12-04T09:31:25.7290661Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7290698Z self.common( 2025-12-04T09:31:25.7290787Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7290830Z return func(*args, **kwds) 2025-12-04T09:31:25.7290956Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 725, in check_model_gpu 2025-12-04T09:31:25.7290994Z check_model( 2025-12-04T09:31:25.7291110Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7291147Z assert_equal_fn( 2025-12-04T09:31:25.7291284Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7291342Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7291502Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7291574Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7291627Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7291629Z 2025-12-04T09:31:25.7291694Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7291790Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7291884Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7291887Z 2025-12-04T09:31:25.7291933Z The failure occurred for item [2] 2025-12-04T09:31:25.7291935Z 2025-12-04T09:31:25.7292008Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7292156Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7292158Z 2025-12-04T09:31:25.7292246Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7292320Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7292375Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7292476Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7292826Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7292889Z graph_break [] 2025-12-04T09:31:25.7292963Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7293020Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7293118Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7293465Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7293505Z graph_break [] 2025-12-04T09:31:25.7293577Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7293636Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7293732Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7294078Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7294115Z graph_break [] 2025-12-04T09:31:25.7294188Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7294243Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7294343Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7294691Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7294730Z graph_break [] 2025-12-04T09:31:25.7294802Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7294859Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7294954Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7295320Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7295361Z graph_break [] 2025-12-04T09:31:25.7295433Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7295491Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7295586Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7295982Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7296018Z graph_break [] 2025-12-04T09:31:25.7296093Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7296147Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7296246Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7296590Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7296661Z graph_break [] 2025-12-04T09:31:25.7296733Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7296788Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7296883Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7297227Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7297263Z graph_break [] 2025-12-04T09:31:25.7297336Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7297392Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7297490Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7297835Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7297871Z graph_break [] 2025-12-04T09:31:25.7297944Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7297998Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7298097Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7298442Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7298481Z graph_break [] 2025-12-04T09:31:25.7298552Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7298607Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7298701Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7299082Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7299119Z graph_break [] 2025-12-04T09:31:25.7299193Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7299247Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7299346Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7299687Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7299724Z graph_break [] 2025-12-04T09:31:25.7299796Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7299850Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7299947Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7300291Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7300353Z graph_break [] 2025-12-04T09:31:25.7300425Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7300483Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7300578Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7300925Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7300961Z graph_break [] 2025-12-04T09:31:25.7301035Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7301092Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7301191Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7301534Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7301572Z graph_break [] 2025-12-04T09:31:25.7301643Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7301698Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7301796Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7302140Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7302178Z graph_break [] 2025-12-04T09:31:25.7302252Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7302308Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7302405Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7302751Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7302807Z graph_break [] 2025-12-04T09:31:25.7302883Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7302939Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7303038Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7303389Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7303426Z graph_break [] 2025-12-04T09:31:25.7303498Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7303557Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7303655Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7304001Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7304061Z graph_break [] 2025-12-04T09:31:25.7304135Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7304189Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7304286Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7304629Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7304669Z graph_break [] 2025-12-04T09:31:25.7304742Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7304798Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7304896Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7305242Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7305279Z graph_break [] 2025-12-04T09:31:25.7305351Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7305405Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7305501Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7305846Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7305884Z graph_break [] 2025-12-04T09:31:25.7306003Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7306056Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7306153Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7306496Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7306562Z graph_break [] 2025-12-04T09:31:25.7306634Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7306692Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7306788Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7307132Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7307169Z graph_break [] 2025-12-04T09:31:25.7307241Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7307297Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7307393Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7307740Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7307803Z graph_break [] 2025-12-04T09:31:25.7307875Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7307928Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7308023Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7308366Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7308405Z graph_break [] 2025-12-04T09:31:25.7308476Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7308534Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7308630Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7308976Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7309011Z graph_break [] 2025-12-04T09:31:25.7309094Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7309139Z Traceback (most recent call last): 2025-12-04T09:31:25.7309266Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7309304Z self.common( 2025-12-04T09:31:25.7309394Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7309436Z return func(*args, **kwds) 2025-12-04T09:31:25.7309563Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 725, in check_model_gpu 2025-12-04T09:31:25.7309601Z check_model( 2025-12-04T09:31:25.7309717Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7309754Z assert_equal_fn( 2025-12-04T09:31:25.7309894Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7309953Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7310113Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7310186Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7310267Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7310269Z 2025-12-04T09:31:25.7310315Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7310409Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7310503Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7310505Z 2025-12-04T09:31:25.7310549Z The failure occurred for item [2] 2025-12-04T09:31:25.7310551Z 2025-12-04T09:31:25.7310625Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7310770Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7310772Z 2025-12-04T09:31:25.7310860Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7310932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7310992Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7311089Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7311434Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7311497Z graph_break [] 2025-12-04T09:31:25.7311571Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7311627Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7311724Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7312070Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7312108Z graph_break [] 2025-12-04T09:31:25.7312183Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7312240Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7312336Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7312681Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7312719Z graph_break [] 2025-12-04T09:31:25.7312790Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7312848Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7312942Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7313285Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7313323Z graph_break [] 2025-12-04T09:31:25.7313397Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7313451Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7313550Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7313917Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7313955Z graph_break [] 2025-12-04T09:31:25.7314027Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7314084Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7314179Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7314526Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7314564Z graph_break [] 2025-12-04T09:31:25.7314636Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7314692Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7314787Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7315131Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7315192Z graph_break [] 2025-12-04T09:31:25.7315266Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7315319Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7315415Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7315762Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7315799Z graph_break [] 2025-12-04T09:31:25.7315871Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7316019Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7316113Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7316461Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7316497Z graph_break [] 2025-12-04T09:31:25.7316570Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7316625Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7316722Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7317067Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7317106Z graph_break [] 2025-12-04T09:31:25.7317179Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7317232Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7317328Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7317698Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7317739Z graph_break [] 2025-12-04T09:31:25.7317812Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7317871Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7317967Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7318312Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7318349Z graph_break [] 2025-12-04T09:31:25.7318422Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7318477Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7318575Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7318918Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7318984Z graph_break [] 2025-12-04T09:31:25.7319056Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7319115Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7319211Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7319560Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7319599Z graph_break [] 2025-12-04T09:31:25.7319672Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7319731Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7319826Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7320177Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7320213Z graph_break [] 2025-12-04T09:31:25.7320288Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7320341Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7320440Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7320784Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7320827Z graph_break [] 2025-12-04T09:31:25.7320899Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7320957Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7321052Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7321417Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7321454Z graph_break [] 2025-12-04T09:31:25.7321528Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7321585Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7321682Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7322028Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7322065Z graph_break [] 2025-12-04T09:31:25.7322140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7322196Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7322294Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7322638Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7322698Z graph_break [] 2025-12-04T09:31:25.7322770Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7322825Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7322922Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7323268Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7323305Z graph_break [] 2025-12-04T09:31:25.7323381Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7323438Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7323537Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7323881Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7323920Z graph_break [] 2025-12-04T09:31:25.7323993Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7324047Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7324146Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7324489Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7324528Z graph_break [] 2025-12-04T09:31:25.7324599Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7324654Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7324749Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7325128Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7325164Z graph_break [] 2025-12-04T09:31:25.7325238Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7325296Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7325392Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7325736Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7325773Z graph_break [] 2025-12-04T09:31:25.7325845Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7325900Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7326041Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7326390Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7326461Z graph_break [] 2025-12-04T09:31:25.7326533Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7326589Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7326685Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7327032Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7327068Z graph_break [] 2025-12-04T09:31:25.7327142Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7327201Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7327297Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7327641Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7327678Z graph_break [] 2025-12-04T09:31:25.7327751Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7327808Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7327905Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7328251Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7328290Z graph_break [] 2025-12-04T09:31:25.7328372Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7328416Z Traceback (most recent call last): 2025-12-04T09:31:25.7328545Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7328581Z self.common( 2025-12-04T09:31:25.7328673Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7328717Z return func(*args, **kwds) 2025-12-04T09:31:25.7328871Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7328907Z check_model( 2025-12-04T09:31:25.7329024Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7329063Z assert_equal_fn( 2025-12-04T09:31:25.7329202Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7329262Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7329423Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7329495Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7329547Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7329549Z 2025-12-04T09:31:25.7329595Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7329698Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7329801Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7329803Z 2025-12-04T09:31:25.7329847Z The failure occurred for item [2] 2025-12-04T09:31:25.7329874Z 2025-12-04T09:31:25.7329949Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7330095Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7330097Z 2025-12-04T09:31:25.7330185Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7330258Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7330315Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7330412Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7330763Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7330801Z graph_break [] 2025-12-04T09:31:25.7330875Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7330933Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7331030Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7331375Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7331413Z graph_break [] 2025-12-04T09:31:25.7331488Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7331544Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7331642Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7331991Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7332030Z graph_break [] 2025-12-04T09:31:25.7332103Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7332162Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7332257Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7332622Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7332661Z graph_break [] 2025-12-04T09:31:25.7332735Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7332791Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7332888Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7333231Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7333268Z graph_break [] 2025-12-04T09:31:25.7333343Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7333399Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7333494Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7333861Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7333900Z graph_break [] 2025-12-04T09:31:25.7333972Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7334028Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7334123Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7334469Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7334507Z graph_break [] 2025-12-04T09:31:25.7334580Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7334634Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7334730Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7335074Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7335111Z graph_break [] 2025-12-04T09:31:25.7335185Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7335242Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7335336Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7335685Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7335721Z graph_break [] 2025-12-04T09:31:25.7335794Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7335848Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7336016Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7336389Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7336427Z graph_break [] 2025-12-04T09:31:25.7336499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7336554Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7336650Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7336997Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7337034Z graph_break [] 2025-12-04T09:31:25.7337109Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7337164Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7337258Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7337640Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7337677Z graph_break [] 2025-12-04T09:31:25.7337750Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7337803Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7337899Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7338246Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7338284Z graph_break [] 2025-12-04T09:31:25.7338356Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7338413Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7338507Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7338857Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7338895Z graph_break [] 2025-12-04T09:31:25.7338969Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7339026Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7339121Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7339473Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7339509Z graph_break [] 2025-12-04T09:31:25.7339582Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7339636Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7339732Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7340097Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7340137Z graph_break [] 2025-12-04T09:31:25.7340208Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7340268Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7340362Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7340711Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7340747Z graph_break [] 2025-12-04T09:31:25.7340821Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7340881Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7340978Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7341352Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7341388Z graph_break [] 2025-12-04T09:31:25.7341461Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7341517Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7341614Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7341957Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7341997Z graph_break [] 2025-12-04T09:31:25.7342068Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7342122Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7342217Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7342565Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7342600Z graph_break [] 2025-12-04T09:31:25.7342675Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7342730Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7342829Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7343176Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7343214Z graph_break [] 2025-12-04T09:31:25.7343287Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7343341Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7343436Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7343801Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7343841Z graph_break [] 2025-12-04T09:31:25.7343914Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7343969Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7344064Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7344412Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7344449Z graph_break [] 2025-12-04T09:31:25.7344525Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7344583Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7344684Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7345053Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7345092Z graph_break [] 2025-12-04T09:31:25.7345166Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7345224Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7345321Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7345675Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7345713Z graph_break [] 2025-12-04T09:31:25.7345785Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7345841Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7345978Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7346328Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7346365Z graph_break [] 2025-12-04T09:31:25.7346441Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7346499Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7346600Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7346949Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7346989Z graph_break [] 2025-12-04T09:31:25.7347064Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7347122Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7347218Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7347600Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7347638Z graph_break [] 2025-12-04T09:31:25.7347717Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7347772Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7347870Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7348218Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7348257Z graph_break [] 2025-12-04T09:31:25.7348343Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7348391Z Traceback (most recent call last): 2025-12-04T09:31:25.7348520Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7348558Z self.common( 2025-12-04T09:31:25.7348678Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7348723Z return func(*args, **kwds) 2025-12-04T09:31:25.7348854Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7348892Z check_model( 2025-12-04T09:31:25.7349011Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7349049Z assert_equal_fn( 2025-12-04T09:31:25.7349189Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7349250Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7349414Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7349489Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7349549Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7349551Z 2025-12-04T09:31:25.7349597Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7349702Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7349806Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7349808Z 2025-12-04T09:31:25.7349853Z The failure occurred for item [2] 2025-12-04T09:31:25.7349855Z 2025-12-04T09:31:25.7349930Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7350081Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7350083Z 2025-12-04T09:31:25.7350172Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7350248Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7350303Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7350406Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7350751Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7350793Z graph_break [] 2025-12-04T09:31:25.7350868Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7350925Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7351045Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7351392Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7351437Z graph_break [] 2025-12-04T09:31:25.7351511Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7351572Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7351668Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7352019Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7352056Z graph_break [] 2025-12-04T09:31:25.7352131Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7352188Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7352319Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7352665Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7352705Z graph_break [] 2025-12-04T09:31:25.7352778Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7352837Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7352934Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7353284Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7353330Z graph_break [] 2025-12-04T09:31:25.7353404Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7353461Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7353558Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7353912Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7353948Z graph_break [] 2025-12-04T09:31:25.7354024Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7354081Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7354181Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7354523Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7354563Z graph_break [] 2025-12-04T09:31:25.7354635Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7354691Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7354810Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7355154Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7355193Z graph_break [] 2025-12-04T09:31:25.7355269Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7355328Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7355426Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7355774Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7355813Z graph_break [] 2025-12-04T09:31:25.7355888Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7355988Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7356117Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7356462Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7356501Z graph_break [] 2025-12-04T09:31:25.7356575Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7356634Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7356731Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7357079Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7357118Z graph_break [] 2025-12-04T09:31:25.7357193Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7357247Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7357346Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7357695Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7357734Z graph_break [] 2025-12-04T09:31:25.7357809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7357865Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7357963Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7358311Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7358352Z graph_break [] 2025-12-04T09:31:25.7358424Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7358485Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7358581Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7358961Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7359003Z graph_break [] 2025-12-04T09:31:25.7359079Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7359136Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7359234Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7359578Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7359621Z graph_break [] 2025-12-04T09:31:25.7359694Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7359750Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7359871Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7360223Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7360260Z graph_break [] 2025-12-04T09:31:25.7360335Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7360396Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7360492Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7360843Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7360883Z graph_break [] 2025-12-04T09:31:25.7360959Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7361016Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7361115Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7361458Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7361501Z graph_break [] 2025-12-04T09:31:25.7361574Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7361635Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7361733Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7362081Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7362119Z graph_break [] 2025-12-04T09:31:25.7362197Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7362252Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7362352Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7362719Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7362761Z graph_break [] 2025-12-04T09:31:25.7362837Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7362893Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7362993Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7363342Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7363386Z graph_break [] 2025-12-04T09:31:25.7363459Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7363517Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7363638Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7363987Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7364023Z graph_break [] 2025-12-04T09:31:25.7364100Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7364156Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7364256Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7364604Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7364644Z graph_break [] 2025-12-04T09:31:25.7364717Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7364777Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7364874Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7365219Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7365260Z graph_break [] 2025-12-04T09:31:25.7365334Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7365391Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7365486Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7365834Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7365872Z graph_break [] 2025-12-04T09:31:25.7366004Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7366060Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7366160Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7366541Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7366583Z graph_break [] 2025-12-04T09:31:25.7366656Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7366716Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7366812Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7367161Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7367199Z graph_break [] 2025-12-04T09:31:25.7367277Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7367335Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7367437Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7367821Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7367861Z graph_break [] 2025-12-04T09:31:25.7367941Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7368000Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7368099Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7370837Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7370889Z graph_break [] 2025-12-04T09:31:25.7370967Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7371023Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7371125Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7371472Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7371514Z graph_break [] 2025-12-04T09:31:25.7371601Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7371651Z Traceback (most recent call last): 2025-12-04T09:31:25.7371783Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7371828Z self.common( 2025-12-04T09:31:25.7371920Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7371968Z return func(*args, **kwds) 2025-12-04T09:31:25.7372097Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7372138Z check_model( 2025-12-04T09:31:25.7372254Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7372295Z assert_equal_fn( 2025-12-04T09:31:25.7372438Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7372537Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7372700Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7372779Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7372833Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7372836Z 2025-12-04T09:31:25.7372883Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7372985Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7373089Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7373091Z 2025-12-04T09:31:25.7373136Z The failure occurred for item [2] 2025-12-04T09:31:25.7373139Z 2025-12-04T09:31:25.7373215Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7373369Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7373373Z 2025-12-04T09:31:25.7373462Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7373539Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7373619Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7373718Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7374068Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7374106Z graph_break [] 2025-12-04T09:31:25.7374178Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7374238Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7374333Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7374684Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7374723Z graph_break [] 2025-12-04T09:31:25.7374799Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7374857Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7374955Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7375301Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7375339Z graph_break [] 2025-12-04T09:31:25.7375411Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7375472Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7375567Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7375912Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7376024Z graph_break [] 2025-12-04T09:31:25.7376096Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7376183Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7376280Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7376629Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7376667Z graph_break [] 2025-12-04T09:31:25.7376740Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7376794Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7376892Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7377238Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7377275Z graph_break [] 2025-12-04T09:31:25.7377348Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7377433Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7377528Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7377878Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7377914Z graph_break [] 2025-12-04T09:31:25.7377988Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7378042Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7378143Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7378490Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7378529Z graph_break [] 2025-12-04T09:31:25.7378603Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7378659Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7378756Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7379103Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7379140Z graph_break [] 2025-12-04T09:31:25.7379213Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7379271Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7379367Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7379716Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7379751Z graph_break [] 2025-12-04T09:31:25.7379826Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7379879Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7379997Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7380340Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7380381Z graph_break [] 2025-12-04T09:31:25.7380454Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7380509Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7380605Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7380949Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7380987Z graph_break [] 2025-12-04T09:31:25.7381059Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7381140Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7381235Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7381581Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7381617Z graph_break [] 2025-12-04T09:31:25.7381692Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7381749Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7381851Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7382197Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7382237Z graph_break [] 2025-12-04T09:31:25.7382309Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7382367Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7382462Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7382809Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7382847Z graph_break [] 2025-12-04T09:31:25.7382918Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7382975Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7383071Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7383414Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7383450Z graph_break [] 2025-12-04T09:31:25.7383523Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7383578Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7383707Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7384056Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7384095Z graph_break [] 2025-12-04T09:31:25.7384167Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7384226Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7384322Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7384673Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7384709Z graph_break [] 2025-12-04T09:31:25.7384782Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7384860Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7384956Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7385300Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7385337Z graph_break [] 2025-12-04T09:31:25.7385409Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7385466Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7385563Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7385912Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7385996Z graph_break [] 2025-12-04T09:31:25.7386069Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7386127Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7386222Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7386568Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7386605Z graph_break [] 2025-12-04T09:31:25.7386678Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7386733Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7386831Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7387176Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7387213Z graph_break [] 2025-12-04T09:31:25.7387286Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7387342Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7387467Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7387815Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7387853Z graph_break [] 2025-12-04T09:31:25.7387927Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7387984Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7388081Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7388431Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7388469Z graph_break [] 2025-12-04T09:31:25.7388542Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7388623Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7388721Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7389064Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7389102Z graph_break [] 2025-12-04T09:31:25.7389175Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7389232Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7389328Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7389675Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7389713Z graph_break [] 2025-12-04T09:31:25.7389789Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7389844Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7389941Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7390292Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7390329Z graph_break [] 2025-12-04T09:31:25.7390401Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7390460Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7390556Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7390902Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7390939Z graph_break [] 2025-12-04T09:31:25.7391011Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7391068Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7391186Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7391531Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7391569Z graph_break [] 2025-12-04T09:31:25.7391643Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7391697Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7391794Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7392137Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7392176Z graph_break [] 2025-12-04T09:31:25.7392248Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7392303Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7392422Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7392770Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7392807Z graph_break [] 2025-12-04T09:31:25.7392890Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7392937Z Traceback (most recent call last): 2025-12-04T09:31:25.7393068Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7393108Z self.common( 2025-12-04T09:31:25.7393196Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7393240Z return func(*args, **kwds) 2025-12-04T09:31:25.7393370Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7393408Z check_model( 2025-12-04T09:31:25.7393523Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7393565Z assert_equal_fn( 2025-12-04T09:31:25.7393705Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7393769Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7393930Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7394007Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7394060Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7394063Z 2025-12-04T09:31:25.7394109Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7394214Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7394318Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7394320Z 2025-12-04T09:31:25.7394365Z The failure occurred for item [2] 2025-12-04T09:31:25.7394367Z 2025-12-04T09:31:25.7394442Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7394589Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7394591Z 2025-12-04T09:31:25.7394680Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7394777Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7394834Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7394931Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7395280Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7395316Z graph_break [] 2025-12-04T09:31:25.7395390Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7395447Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7395546Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7395896Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7396027Z graph_break [] 2025-12-04T09:31:25.7396102Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7396159Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7396256Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7396606Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7396644Z graph_break [] 2025-12-04T09:31:25.7396718Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7396775Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7396869Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7397214Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7397250Z graph_break [] 2025-12-04T09:31:25.7397324Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7397380Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7397478Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7397825Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7397865Z graph_break [] 2025-12-04T09:31:25.7397939Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7397993Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7398092Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7398436Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7398475Z graph_break [] 2025-12-04T09:31:25.7398575Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7398633Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7398728Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7399074Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7399110Z graph_break [] 2025-12-04T09:31:25.7399183Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7399238Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7399334Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7399679Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7399745Z graph_break [] 2025-12-04T09:31:25.7399819Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7399876Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7399972Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7400320Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7400357Z graph_break [] 2025-12-04T09:31:25.7400431Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7400486Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7400581Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7400933Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7400969Z graph_break [] 2025-12-04T09:31:25.7401043Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7401098Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7401194Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7401539Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7401577Z graph_break [] 2025-12-04T09:31:25.7401651Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7401707Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7401801Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7402148Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7402184Z graph_break [] 2025-12-04T09:31:25.7402258Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7402333Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7402431Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7402782Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7402819Z graph_break [] 2025-12-04T09:31:25.7402892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7402948Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7403044Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7403389Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7403428Z graph_break [] 2025-12-04T09:31:25.7403525Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7403583Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7403678Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7404025Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7404063Z graph_break [] 2025-12-04T09:31:25.7404137Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7404192Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7404288Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7404634Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7404672Z graph_break [] 2025-12-04T09:31:25.7404745Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7404803Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7404898Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7405247Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7405285Z graph_break [] 2025-12-04T09:31:25.7405359Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7405416Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7405511Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7405861Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7405898Z graph_break [] 2025-12-04T09:31:25.7406030Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7406115Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7406213Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7406558Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7406599Z graph_break [] 2025-12-04T09:31:25.7406671Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7406726Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7406821Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7407171Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7407207Z graph_break [] 2025-12-04T09:31:25.7407311Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7407366Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7407462Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7407808Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7407845Z graph_break [] 2025-12-04T09:31:25.7407920Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7407975Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7408072Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7408416Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7408455Z graph_break [] 2025-12-04T09:31:25.7408527Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7408582Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7408677Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7409022Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7409058Z graph_break [] 2025-12-04T09:31:25.7409134Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7409190Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7409287Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7409634Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7409672Z graph_break [] 2025-12-04T09:31:25.7409745Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7409820Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7409916Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7410256Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7410294Z graph_break [] 2025-12-04T09:31:25.7410367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7410421Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7410516Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7410862Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7410900Z graph_break [] 2025-12-04T09:31:25.7410973Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7411060Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7411155Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7411505Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7411541Z graph_break [] 2025-12-04T09:31:25.7411612Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7411670Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7411766Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7412111Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7412149Z graph_break [] 2025-12-04T09:31:25.7412221Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7412274Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7412370Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7412716Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7412753Z graph_break [] 2025-12-04T09:31:25.7412826Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7412883Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7412978Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7413327Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7413364Z graph_break [] 2025-12-04T09:31:25.7413436Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7413514Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7413610Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7413956Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7413996Z graph_break [] 2025-12-04T09:31:25.7414069Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7414122Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7414219Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7414564Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7414601Z graph_break [] 2025-12-04T09:31:25.7414683Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7414754Z Traceback (most recent call last): 2025-12-04T09:31:25.7414881Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7414920Z self.common( 2025-12-04T09:31:25.7415009Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7415052Z return func(*args, **kwds) 2025-12-04T09:31:25.7415178Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7415216Z check_model( 2025-12-04T09:31:25.7415333Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7415374Z assert_equal_fn( 2025-12-04T09:31:25.7415514Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7415575Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7415736Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7415809Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7415861Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7415863Z 2025-12-04T09:31:25.7415909Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7416067Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7416170Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7416172Z 2025-12-04T09:31:25.7416216Z The failure occurred for item [2] 2025-12-04T09:31:25.7416218Z 2025-12-04T09:31:25.7416294Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7416441Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7416445Z 2025-12-04T09:31:25.7416532Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7416606Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7416661Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7416759Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7417106Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7417173Z graph_break [] 2025-12-04T09:31:25.7417245Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7417304Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7417402Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7417752Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7417788Z graph_break [] 2025-12-04T09:31:25.7417862Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7417918Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7418014Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7418360Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7418428Z graph_break [] 2025-12-04T09:31:25.7418500Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7418557Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7418652Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7418998Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7419036Z graph_break [] 2025-12-04T09:31:25.7419108Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7419164Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7419261Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7419609Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7419646Z graph_break [] 2025-12-04T09:31:25.7419720Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7419774Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7419871Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7420218Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7420257Z graph_break [] 2025-12-04T09:31:25.7420329Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7420383Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7420478Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7420822Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7420859Z graph_break [] 2025-12-04T09:31:25.7420955Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7421009Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7421106Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7421454Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7421491Z graph_break [] 2025-12-04T09:31:25.7421564Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7421621Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7421719Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7422068Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7422130Z graph_break [] 2025-12-04T09:31:25.7422203Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7422258Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7422354Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7422699Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7422734Z graph_break [] 2025-12-04T09:31:25.7422809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7422862Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7422959Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7423304Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7423341Z graph_break [] 2025-12-04T09:31:25.7423413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7423468Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7423566Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7423915Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7423956Z graph_break [] 2025-12-04T09:31:25.7424028Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7424082Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7424176Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7424521Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7424557Z graph_break [] 2025-12-04T09:31:25.7424658Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7424715Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7424811Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7425156Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7425194Z graph_break [] 2025-12-04T09:31:25.7425265Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7425322Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7425417Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7425764Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7425825Z graph_break [] 2025-12-04T09:31:25.7425898Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7425991Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7426087Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7426436Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7426472Z graph_break [] 2025-12-04T09:31:25.7426547Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7426603Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7426698Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7427044Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7427082Z graph_break [] 2025-12-04T09:31:25.7427155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7427212Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7427308Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7427658Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7427695Z graph_break [] 2025-12-04T09:31:25.7427771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7427829Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7427926Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7428273Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7428312Z graph_break [] 2025-12-04T09:31:25.7428411Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7428469Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7428565Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7428909Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7428946Z graph_break [] 2025-12-04T09:31:25.7429018Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7429076Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7429172Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7429520Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7429584Z graph_break [] 2025-12-04T09:31:25.7429658Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7429714Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7429810Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7430158Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7430198Z graph_break [] 2025-12-04T09:31:25.7430271Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7430331Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7430426Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7430774Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7430813Z graph_break [] 2025-12-04T09:31:25.7430885Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7430944Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7431040Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7431387Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7431426Z graph_break [] 2025-12-04T09:31:25.7431502Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7431557Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7431655Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7432001Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7432039Z graph_break [] 2025-12-04T09:31:25.7432134Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7432190Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7432286Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7432633Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7432670Z graph_break [] 2025-12-04T09:31:25.7432744Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7432800Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7432897Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7433246Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7433309Z graph_break [] 2025-12-04T09:31:25.7433383Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7433440Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7433537Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7433880Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7433918Z graph_break [] 2025-12-04T09:31:25.7433993Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7434050Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7434146Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7434498Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7434535Z graph_break [] 2025-12-04T09:31:25.7434609Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7434662Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7434759Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7435105Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7435146Z graph_break [] 2025-12-04T09:31:25.7435220Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7435276Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7435373Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7435717Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7435755Z graph_break [] 2025-12-04T09:31:25.7435850Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7435907Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7436060Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7436411Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7436449Z graph_break [] 2025-12-04T09:31:25.7436523Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7436580Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7436679Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7437025Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7437091Z graph_break [] 2025-12-04T09:31:25.7437165Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7437222Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7437317Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7437663Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7437704Z graph_break [] 2025-12-04T09:31:25.7437786Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7437834Z Traceback (most recent call last): 2025-12-04T09:31:25.7437960Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7438000Z self.common( 2025-12-04T09:31:25.7438088Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7438131Z return func(*args, **kwds) 2025-12-04T09:31:25.7438256Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 725, in check_model_gpu 2025-12-04T09:31:25.7438296Z check_model( 2025-12-04T09:31:25.7438411Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7438450Z assert_equal_fn( 2025-12-04T09:31:25.7438589Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7438649Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7438810Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7438884Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7438937Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7438940Z 2025-12-04T09:31:25.7438986Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7439083Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7439178Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7439180Z 2025-12-04T09:31:25.7439226Z The failure occurred for item [2] 2025-12-04T09:31:25.7439228Z 2025-12-04T09:31:25.7439304Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7439450Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7439452Z 2025-12-04T09:31:25.7439577Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7439651Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7439712Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7439810Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7440157Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7440194Z graph_break [] 2025-12-04T09:31:25.7440269Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7440328Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7440426Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7440772Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7440832Z graph_break [] 2025-12-04T09:31:25.7440906Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7440964Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7441063Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7441412Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7441450Z graph_break [] 2025-12-04T09:31:25.7441522Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7441582Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7441679Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7442024Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7442060Z graph_break [] 2025-12-04T09:31:25.7442134Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7442191Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7442289Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7442634Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7442675Z graph_break [] 2025-12-04T09:31:25.7442750Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7442807Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7442904Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7443274Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7443312Z graph_break [] 2025-12-04T09:31:25.7443385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7443441Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7443537Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7443882Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7443918Z graph_break [] 2025-12-04T09:31:25.7443994Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7444049Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7444147Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7444492Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7444556Z graph_break [] 2025-12-04T09:31:25.7444630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7444688Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7444784Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7445137Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7445175Z graph_break [] 2025-12-04T09:31:25.7445250Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7445306Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7445404Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7445754Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7445791Z graph_break [] 2025-12-04T09:31:25.7445865Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7445959Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7446061Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7446405Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7446446Z graph_break [] 2025-12-04T09:31:25.7446519Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7446575Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7446670Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7447050Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7447087Z graph_break [] 2025-12-04T09:31:25.7447163Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7447218Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7447318Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7447658Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7447696Z graph_break [] 2025-12-04T09:31:25.7447770Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7447827Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7447926Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7448271Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7448338Z graph_break [] 2025-12-04T09:31:25.7448410Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7448468Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7448564Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7448910Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7448948Z graph_break [] 2025-12-04T09:31:25.7449024Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7449078Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7449177Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7449524Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7449563Z graph_break [] 2025-12-04T09:31:25.7449637Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7449695Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7449791Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7450135Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7450174Z graph_break [] 2025-12-04T09:31:25.7450246Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7450303Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7450398Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7450744Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7450804Z graph_break [] 2025-12-04T09:31:25.7450879Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7450935Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7451033Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7451383Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7451420Z graph_break [] 2025-12-04T09:31:25.7451493Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7451550Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7451647Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7451992Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7452056Z graph_break [] 2025-12-04T09:31:25.7452132Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7452188Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7452284Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7452630Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7452668Z graph_break [] 2025-12-04T09:31:25.7452742Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7452796Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7452894Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7453235Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7453272Z graph_break [] 2025-12-04T09:31:25.7453345Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7453401Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7453496Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7453844Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7453881Z graph_break [] 2025-12-04T09:31:25.7453954Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7454010Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7454105Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7454449Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7454509Z graph_break [] 2025-12-04T09:31:25.7454581Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7454635Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7454732Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7455077Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7455114Z graph_break [] 2025-12-04T09:31:25.7455187Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7455241Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7455336Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7455687Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7455747Z graph_break [] 2025-12-04T09:31:25.7455821Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7455876Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7456031Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7456378Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7456416Z graph_break [] 2025-12-04T09:31:25.7456490Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7456546Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7456643Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7456988Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7457026Z graph_break [] 2025-12-04T09:31:25.7457098Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7457153Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7457249Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7457599Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7457636Z graph_break [] 2025-12-04T09:31:25.7457709Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7457762Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7457858Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7458201Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7458239Z graph_break [] 2025-12-04T09:31:25.7458338Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7458393Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7458488Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7458836Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7458872Z graph_break [] 2025-12-04T09:31:25.7458947Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7459000Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7459098Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7459441Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7459510Z graph_break [] 2025-12-04T09:31:25.7459583Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7459640Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7459738Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7460084Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7460121Z graph_break [] 2025-12-04T09:31:25.7460195Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7460251Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7460345Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7460692Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7460728Z graph_break [] 2025-12-04T09:31:25.7460803Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7460859Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7460954Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7461300Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7461338Z graph_break [] 2025-12-04T09:31:25.7461418Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7461467Z Traceback (most recent call last): 2025-12-04T09:31:25.7461593Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7461631Z self.common( 2025-12-04T09:31:25.7461719Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7461762Z return func(*args, **kwds) 2025-12-04T09:31:25.7461888Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7461925Z check_model( 2025-12-04T09:31:25.7462070Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7462109Z assert_equal_fn( 2025-12-04T09:31:25.7462248Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7462310Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7462470Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7462544Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7462597Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7462599Z 2025-12-04T09:31:25.7462643Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7462746Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7462850Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7462852Z 2025-12-04T09:31:25.7462897Z The failure occurred for item [2] 2025-12-04T09:31:25.7462899Z 2025-12-04T09:31:25.7462973Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7463144Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7463147Z 2025-12-04T09:31:25.7463234Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7463308Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7463362Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7463460Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7463808Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7463845Z graph_break [] 2025-12-04T09:31:25.7463919Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7463979Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7464075Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7464424Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7464461Z graph_break [] 2025-12-04T09:31:25.7464533Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7464593Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7464688Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7465034Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7465073Z graph_break [] 2025-12-04T09:31:25.7465147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7465202Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7465297Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7465664Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7465703Z graph_break [] 2025-12-04T09:31:25.7465775Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7465832Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7465974Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7466321Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7466357Z graph_break [] 2025-12-04T09:31:25.7466432Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7466488Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7466585Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7466931Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7467011Z graph_break [] 2025-12-04T09:31:25.7467085Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7467139Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7467237Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7467581Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7467618Z graph_break [] 2025-12-04T09:31:25.7467689Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7467746Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7467841Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7468187Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7468223Z graph_break [] 2025-12-04T09:31:25.7468296Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7468352Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7468448Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7468794Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7468832Z graph_break [] 2025-12-04T09:31:25.7468904Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7468959Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7469053Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7469427Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7469464Z graph_break [] 2025-12-04T09:31:25.7469536Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7469592Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7469686Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7470026Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7470062Z graph_break [] 2025-12-04T09:31:25.7470136Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7470189Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7470287Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7470630Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7470693Z graph_break [] 2025-12-04T09:31:25.7470765Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7470819Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7470913Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7471259Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7471294Z graph_break [] 2025-12-04T09:31:25.7471367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7471426Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7471522Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7471868Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7471903Z graph_break [] 2025-12-04T09:31:25.7471976Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7472032Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7472130Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7472478Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7472517Z graph_break [] 2025-12-04T09:31:25.7472589Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7472643Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7472738Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7473108Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7473145Z graph_break [] 2025-12-04T09:31:25.7473217Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7473274Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7473371Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7473716Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7473752Z graph_break [] 2025-12-04T09:31:25.7473825Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7473881Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7473979Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7474327Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7474391Z graph_break [] 2025-12-04T09:31:25.7474464Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7474522Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7474617Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7474968Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7475004Z graph_break [] 2025-12-04T09:31:25.7475077Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7475132Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7475228Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7475572Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7475609Z graph_break [] 2025-12-04T09:31:25.7475682Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7475738Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7475836Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7476220Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7476259Z graph_break [] 2025-12-04T09:31:25.7476330Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7476385Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7476480Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7476856Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7476892Z graph_break [] 2025-12-04T09:31:25.7476964Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7477019Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7477114Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7477458Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7477495Z graph_break [] 2025-12-04T09:31:25.7477567Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7477623Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7477719Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7478066Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7478128Z graph_break [] 2025-12-04T09:31:25.7478201Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7478255Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7478351Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7478701Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7478738Z graph_break [] 2025-12-04T09:31:25.7478810Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7478866Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7478961Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7479302Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7479339Z graph_break [] 2025-12-04T09:31:25.7479411Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7479467Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7479563Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7479908Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7479946Z graph_break [] 2025-12-04T09:31:25.7480018Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7480073Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7480169Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7480566Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7480605Z graph_break [] 2025-12-04T09:31:25.7480676Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7480734Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7480829Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7481177Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7481215Z graph_break [] 2025-12-04T09:31:25.7481287Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7481342Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7481439Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7481783Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7481844Z graph_break [] 2025-12-04T09:31:25.7481916Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7481970Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7482066Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7482410Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7482447Z graph_break [] 2025-12-04T09:31:25.7482519Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7482574Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7482671Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7483022Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7483059Z graph_break [] 2025-12-04T09:31:25.7483133Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7483189Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7483286Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7483634Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7483671Z graph_break [] 2025-12-04T09:31:25.7483745Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7483799Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7483895Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7484260Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7484298Z graph_break [] 2025-12-04T09:31:25.7484370Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7484427Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7484524Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7484872Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7484908Z graph_break [] 2025-12-04T09:31:25.7484980Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7485034Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7485132Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7485472Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7485530Z graph_break [] 2025-12-04T09:31:25.7485611Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7485657Z Traceback (most recent call last): 2025-12-04T09:31:25.7485783Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7485821Z self.common( 2025-12-04T09:31:25.7485909Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7486010Z return func(*args, **kwds) 2025-12-04T09:31:25.7486137Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 725, in check_model_gpu 2025-12-04T09:31:25.7486175Z check_model( 2025-12-04T09:31:25.7486290Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7486331Z assert_equal_fn( 2025-12-04T09:31:25.7486469Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7486530Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7486690Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7486761Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7486815Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7486817Z 2025-12-04T09:31:25.7486861Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7486959Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7487053Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7487055Z 2025-12-04T09:31:25.7487101Z The failure occurred for item [2] 2025-12-04T09:31:25.7487103Z 2025-12-04T09:31:25.7487177Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7487324Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7487326Z 2025-12-04T09:31:25.7487413Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7487487Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7487542Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7487641Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7488011Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7488052Z graph_break [] 2025-12-04T09:31:25.7488124Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7488181Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7488277Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7488624Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7488661Z graph_break [] 2025-12-04T09:31:25.7488734Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7488791Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7488885Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7489266Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7489302Z graph_break [] 2025-12-04T09:31:25.7489376Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7489431Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7489528Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7489874Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7489912Z graph_break [] 2025-12-04T09:31:25.7489985Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7490043Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7490137Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7490484Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7490520Z graph_break [] 2025-12-04T09:31:25.7490593Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7490649Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7490746Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7491094Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7491130Z graph_break [] 2025-12-04T09:31:25.7491204Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7491258Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7491355Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7491724Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7491762Z graph_break [] 2025-12-04T09:31:25.7491837Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7491892Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7491987Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7492334Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7492369Z graph_break [] 2025-12-04T09:31:25.7492443Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7492503Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7492600Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7492965Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7493002Z graph_break [] 2025-12-04T09:31:25.7493075Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7493130Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7493225Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7493574Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7493612Z graph_break [] 2025-12-04T09:31:25.7493686Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7493740Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7493836Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7494180Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7494216Z graph_break [] 2025-12-04T09:31:25.7494289Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7494345Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7494441Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7494783Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7494823Z graph_break [] 2025-12-04T09:31:25.7494895Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7494950Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7495044Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7495417Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7495454Z graph_break [] 2025-12-04T09:31:25.7495529Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7495587Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7495681Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7496086Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7496122Z graph_break [] 2025-12-04T09:31:25.7496194Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7496252Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7496348Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7496692Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7496762Z graph_break [] 2025-12-04T09:31:25.7496834Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7496889Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7496984Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7497329Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7497366Z graph_break [] 2025-12-04T09:31:25.7497440Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7497497Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7497594Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7497940Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7497977Z graph_break [] 2025-12-04T09:31:25.7498051Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7498108Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7498204Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7498547Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7498585Z graph_break [] 2025-12-04T09:31:25.7498658Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7498715Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7498809Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7499184Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7499220Z graph_break [] 2025-12-04T09:31:25.7499293Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7499350Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7507451Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7508020Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7508072Z graph_break [] 2025-12-04T09:31:25.7508167Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7508275Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7508386Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7508742Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7509121Z graph_break [] 2025-12-04T09:31:25.7509201Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7509264Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7509367Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7509719Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7509759Z graph_break [] 2025-12-04T09:31:25.7509838Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7509897Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7509999Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7510344Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7510386Z graph_break [] 2025-12-04T09:31:25.7510461Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7510527Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7510625Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7510972Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7511013Z graph_break [] 2025-12-04T09:31:25.7511090Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7511146Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7511248Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7511705Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7511745Z graph_break [] 2025-12-04T09:31:25.7511824Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7511882Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7511984Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7512329Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7512370Z graph_break [] 2025-12-04T09:31:25.7512445Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7512510Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7512607Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7512959Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7513097Z graph_break [] 2025-12-04T09:31:25.7513174Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7513233Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7513334Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7513682Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7513726Z graph_break [] 2025-12-04T09:31:25.7513800Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7513860Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7513957Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7514302Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7514343Z graph_break [] 2025-12-04T09:31:25.7514416Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7514476Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7514574Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7514923Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7514963Z graph_break [] 2025-12-04T09:31:25.7515040Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7515096Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7515196Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7515558Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7515600Z graph_break [] 2025-12-04T09:31:25.7515677Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7515738Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7515835Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7516266Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7516305Z graph_break [] 2025-12-04T09:31:25.7516382Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7516440Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7516541Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7516887Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7516966Z graph_break [] 2025-12-04T09:31:25.7517044Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7517100Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7517201Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7517547Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7517590Z graph_break [] 2025-12-04T09:31:25.7517663Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7517728Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7517826Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7518174Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7518212Z graph_break [] 2025-12-04T09:31:25.7518289Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7518344Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7518445Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7518790Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7518833Z graph_break [] 2025-12-04T09:31:25.7518910Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7518969Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7519070Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7519452Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7519494Z graph_break [] 2025-12-04T09:31:25.7519578Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7519634Z Traceback (most recent call last): 2025-12-04T09:31:25.7519777Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7519821Z self.common( 2025-12-04T09:31:25.7519921Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7519975Z return func(*args, **kwds) 2025-12-04T09:31:25.7520105Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 725, in check_model_gpu 2025-12-04T09:31:25.7520147Z check_model( 2025-12-04T09:31:25.7520264Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7520309Z assert_equal_fn( 2025-12-04T09:31:25.7520457Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7520524Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7520739Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7520819Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7520874Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7520879Z 2025-12-04T09:31:25.7520931Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7521032Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7521129Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7521132Z 2025-12-04T09:31:25.7521179Z The failure occurred for item [2] 2025-12-04T09:31:25.7521181Z 2025-12-04T09:31:25.7521262Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7521413Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7521416Z 2025-12-04T09:31:25.7521514Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7521592Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7521653Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7521755Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7522107Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7522149Z graph_break [] 2025-12-04T09:31:25.7522226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7522289Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7522391Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7522743Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7522781Z graph_break [] 2025-12-04T09:31:25.7522858Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7522916Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7523017Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7523381Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7523425Z graph_break [] 2025-12-04T09:31:25.7523500Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7523561Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7523658Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7524008Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7524046Z graph_break [] 2025-12-04T09:31:25.7524124Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7524185Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7524282Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7524652Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7524691Z graph_break [] 2025-12-04T09:31:25.7524768Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7524824Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7524925Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7525270Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7525314Z graph_break [] 2025-12-04T09:31:25.7525387Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7525446Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7525543Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7525894Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7525966Z graph_break [] 2025-12-04T09:31:25.7526045Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7526102Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7526202Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7526547Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7526588Z graph_break [] 2025-12-04T09:31:25.7526667Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7526725Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7526825Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7527203Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7527246Z graph_break [] 2025-12-04T09:31:25.7527320Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7527379Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7527476Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7527827Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7527865Z graph_break [] 2025-12-04T09:31:25.7527943Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7528000Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7528100Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7528469Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7528511Z graph_break [] 2025-12-04T09:31:25.7528585Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7528647Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7528744Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7529091Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7529135Z graph_break [] 2025-12-04T09:31:25.7529209Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7529269Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7529366Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7529711Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7529749Z graph_break [] 2025-12-04T09:31:25.7529827Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7529885Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7529986Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7530334Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7530374Z graph_break [] 2025-12-04T09:31:25.7530448Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7530509Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7530610Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7530979Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7531019Z graph_break [] 2025-12-04T09:31:25.7531095Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7531151Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7531250Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7531595Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7531633Z graph_break [] 2025-12-04T09:31:25.7531713Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7531771Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7531870Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7532244Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7532285Z graph_break [] 2025-12-04T09:31:25.7532359Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7532420Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7532517Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7532868Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7532907Z graph_break [] 2025-12-04T09:31:25.7532984Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7533041Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7533141Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7533485Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7533527Z graph_break [] 2025-12-04T09:31:25.7533603Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7533662Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7533759Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7534106Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7534146Z graph_break [] 2025-12-04T09:31:25.7534220Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7534280Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7534376Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7534746Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7534789Z graph_break [] 2025-12-04T09:31:25.7534866Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7534921Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7535021Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7535361Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7535402Z graph_break [] 2025-12-04T09:31:25.7535476Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7535535Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7535632Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7536066Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7536104Z graph_break [] 2025-12-04T09:31:25.7536180Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7536238Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7536340Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7536693Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7536732Z graph_break [] 2025-12-04T09:31:25.7536810Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7536866Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7536968Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7537311Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7537352Z graph_break [] 2025-12-04T09:31:25.7537426Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7537485Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7537583Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7537934Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7537972Z graph_break [] 2025-12-04T09:31:25.7538048Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7538106Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7538206Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7538590Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7538633Z graph_break [] 2025-12-04T09:31:25.7538712Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7538770Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7538870Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7539214Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7539255Z graph_break [] 2025-12-04T09:31:25.7539328Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7539389Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7539486Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7539893Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7539933Z graph_break [] 2025-12-04T09:31:25.7540015Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7540073Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7540179Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7540525Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7540571Z graph_break [] 2025-12-04T09:31:25.7540651Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7540713Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7546829Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7547211Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7547263Z graph_break [] 2025-12-04T09:31:25.7547345Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7547421Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7547524Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7547883Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7547938Z graph_break [] 2025-12-04T09:31:25.7548016Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7548087Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7548189Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7548627Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7548672Z graph_break [] 2025-12-04T09:31:25.7548759Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7548820Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7548931Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7549279Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7549331Z graph_break [] 2025-12-04T09:31:25.7549407Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7549473Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7549572Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7549924Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7550004Z graph_break [] 2025-12-04T09:31:25.7550087Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7550144Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7550247Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7550600Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7550640Z graph_break [] 2025-12-04T09:31:25.7550722Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7550783Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7550885Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7551233Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7551279Z graph_break [] 2025-12-04T09:31:25.7551356Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7551420Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7551520Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7551871Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7551913Z graph_break [] 2025-12-04T09:31:25.7551995Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7552053Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7552157Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7552525Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7552570Z graph_break [] 2025-12-04T09:31:25.7552655Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7552715Z Traceback (most recent call last): 2025-12-04T09:31:25.7552853Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7552900Z self.common( 2025-12-04T09:31:25.7552998Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7553054Z return func(*args, **kwds) 2025-12-04T09:31:25.7553186Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7553231Z check_model( 2025-12-04T09:31:25.7553356Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7553399Z assert_equal_fn( 2025-12-04T09:31:25.7553551Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7553616Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7553811Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7553888Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7553950Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7553954Z 2025-12-04T09:31:25.7554003Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7554121Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7554228Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7554230Z 2025-12-04T09:31:25.7554283Z The failure occurred for item [2] 2025-12-04T09:31:25.7554288Z 2025-12-04T09:31:25.7554368Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7554529Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7554534Z 2025-12-04T09:31:25.7554626Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7554710Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7554772Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7554878Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7555228Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7555275Z graph_break [] 2025-12-04T09:31:25.7555351Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7555417Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7555518Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7555876Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7556028Z graph_break [] 2025-12-04T09:31:25.7556106Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7556170Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7556313Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7556662Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7556704Z graph_break [] 2025-12-04T09:31:25.7556785Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7556844Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7556948Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7557292Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7557337Z graph_break [] 2025-12-04T09:31:25.7557412Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7557477Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7557604Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7557953Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7557992Z graph_break [] 2025-12-04T09:31:25.7558074Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7558133Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7558237Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7558591Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7558633Z graph_break [] 2025-12-04T09:31:25.7558716Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7558775Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7558878Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7559225Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7559270Z graph_break [] 2025-12-04T09:31:25.7559346Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7559409Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7559508Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7559858Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7559897Z graph_break [] 2025-12-04T09:31:25.7559976Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7560035Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7560138Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7560510Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7560556Z graph_break [] 2025-12-04T09:31:25.7560631Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7560693Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7560797Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7561141Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7561185Z graph_break [] 2025-12-04T09:31:25.7561257Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7561320Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7561438Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7561784Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7561822Z graph_break [] 2025-12-04T09:31:25.7561900Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7561956Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7562057Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7562406Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7562450Z graph_break [] 2025-12-04T09:31:25.7562524Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7562584Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7562681Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7563029Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7563067Z graph_break [] 2025-12-04T09:31:25.7563147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7563206Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7563307Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7563656Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7563694Z graph_break [] 2025-12-04T09:31:25.7563773Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7563831Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7563932Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7564301Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7564345Z graph_break [] 2025-12-04T09:31:25.7564420Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7564479Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7564578Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7564926Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7564964Z graph_break [] 2025-12-04T09:31:25.7565044Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7565103Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7565203Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7565568Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7565611Z graph_break [] 2025-12-04T09:31:25.7565685Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7565747Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7565844Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7566227Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7566270Z graph_break [] 2025-12-04T09:31:25.7566344Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7566405Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7566502Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7566857Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7566895Z graph_break [] 2025-12-04T09:31:25.7566975Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7567032Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7567134Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7567481Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7567523Z graph_break [] 2025-12-04T09:31:25.7567598Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7567660Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7567756Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7568139Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7568180Z graph_break [] 2025-12-04T09:31:25.7568258Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7568314Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7568415Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7568761Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7568799Z graph_break [] 2025-12-04T09:31:25.7568878Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7568934Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7569035Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7569406Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7569449Z graph_break [] 2025-12-04T09:31:25.7569523Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7569586Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7569683Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7570034Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7570075Z graph_break [] 2025-12-04T09:31:25.7570152Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7570208Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7570309Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7570653Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7570694Z graph_break [] 2025-12-04T09:31:25.7570770Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7570830Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7570932Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7571280Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7571322Z graph_break [] 2025-12-04T09:31:25.7571396Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7571458Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7571555Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7571934Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7571975Z graph_break [] 2025-12-04T09:31:25.7572052Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7572110Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7572212Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7572555Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7572597Z graph_break [] 2025-12-04T09:31:25.7572673Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7572733Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7572830Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7573208Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7573250Z graph_break [] 2025-12-04T09:31:25.7573324Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7573384Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7573481Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7573831Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7573871Z graph_break [] 2025-12-04T09:31:25.7573949Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7574006Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7574107Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7574448Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7574490Z graph_break [] 2025-12-04T09:31:25.7574566Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7574626Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7574723Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7575071Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7575110Z graph_break [] 2025-12-04T09:31:25.7575189Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7575247Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7575348Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7575718Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7575763Z graph_break [] 2025-12-04T09:31:25.7575841Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7575898Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7576028Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7576373Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7576416Z graph_break [] 2025-12-04T09:31:25.7576492Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7576554Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7576651Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7577037Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7577075Z graph_break [] 2025-12-04T09:31:25.7577153Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7577210Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7577311Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7577658Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7577702Z graph_break [] 2025-12-04T09:31:25.7577776Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7577838Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7577935Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7578283Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7578325Z graph_break [] 2025-12-04T09:31:25.7578401Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7578463Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7578561Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7578911Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7578949Z graph_break [] 2025-12-04T09:31:25.7579027Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7579086Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7579187Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7579565Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7579609Z graph_break [] 2025-12-04T09:31:25.7579684Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7579745Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7579844Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7580189Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7580228Z graph_break [] 2025-12-04T09:31:25.7580316Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7580365Z Traceback (most recent call last): 2025-12-04T09:31:25.7580498Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7580561Z self.common( 2025-12-04T09:31:25.7580657Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7580702Z return func(*args, **kwds) 2025-12-04T09:31:25.7580835Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7580874Z check_model( 2025-12-04T09:31:25.7580996Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7581036Z assert_equal_fn( 2025-12-04T09:31:25.7581182Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7581249Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7581412Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7581491Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7581549Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7581551Z 2025-12-04T09:31:25.7581602Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7581708Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7581819Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7581821Z 2025-12-04T09:31:25.7581868Z The failure occurred for item [2] 2025-12-04T09:31:25.7581870Z 2025-12-04T09:31:25.7581950Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7582101Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7582105Z 2025-12-04T09:31:25.7582199Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7582275Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7582338Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7582438Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7582784Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7582823Z graph_break [] 2025-12-04T09:31:25.7582901Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7582961Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7583085Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7583436Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7583477Z graph_break [] 2025-12-04T09:31:25.7583559Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7583621Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7583724Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7584079Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7584122Z graph_break [] 2025-12-04T09:31:25.7584196Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7584288Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7584386Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7584734Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7584772Z graph_break [] 2025-12-04T09:31:25.7584851Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7584909Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7585012Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7585359Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7585403Z graph_break [] 2025-12-04T09:31:25.7585479Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7585540Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7585638Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7586030Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7586074Z graph_break [] 2025-12-04T09:31:25.7586149Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7586211Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7586308Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7586657Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7586695Z graph_break [] 2025-12-04T09:31:25.7586774Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7586830Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7586957Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7587305Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7587349Z graph_break [] 2025-12-04T09:31:25.7587424Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7587486Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7587583Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7587933Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7587972Z graph_break [] 2025-12-04T09:31:25.7588050Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7588134Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7588236Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7588585Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7588627Z graph_break [] 2025-12-04T09:31:25.7588701Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7588761Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7588860Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7589207Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7589248Z graph_break [] 2025-12-04T09:31:25.7589328Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7589385Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7589486Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7589830Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7589873Z graph_break [] 2025-12-04T09:31:25.7589951Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7590007Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7590109Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7590458Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7590501Z graph_break [] 2025-12-04T09:31:25.7590576Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7590639Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7590757Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7591106Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7591147Z graph_break [] 2025-12-04T09:31:25.7591226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7591285Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7591385Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7591735Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7591778Z graph_break [] 2025-12-04T09:31:25.7591853Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7591913Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7592034Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7592381Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7592424Z graph_break [] 2025-12-04T09:31:25.7592499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7592562Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7592662Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7593012Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7593052Z graph_break [] 2025-12-04T09:31:25.7593131Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7593189Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7593291Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7593639Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7593682Z graph_break [] 2025-12-04T09:31:25.7593757Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7593819Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7593918Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7594268Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7594307Z graph_break [] 2025-12-04T09:31:25.7594385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7594442Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7594564Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7594916Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7594957Z graph_break [] 2025-12-04T09:31:25.7595036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7595095Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7595196Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7595540Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7595584Z graph_break [] 2025-12-04T09:31:25.7595658Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7595719Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7595837Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7596236Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7596275Z graph_break [] 2025-12-04T09:31:25.7596354Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7596411Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7596514Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7596861Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7596905Z graph_break [] 2025-12-04T09:31:25.7596981Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7597044Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7597145Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7597491Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7597533Z graph_break [] 2025-12-04T09:31:25.7597607Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7597663Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7597760Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7598103Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7598139Z graph_break [] 2025-12-04T09:31:25.7598214Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7598268Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7598365Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7598737Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7598778Z graph_break [] 2025-12-04T09:31:25.7598851Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7598910Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7599006Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7599353Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7599392Z graph_break [] 2025-12-04T09:31:25.7599465Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7599522Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7599652Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7599997Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7600034Z graph_break [] 2025-12-04T09:31:25.7600108Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7600162Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7600260Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7600601Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7600641Z graph_break [] 2025-12-04T09:31:25.7600714Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7600770Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7600865Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7601218Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7601256Z graph_break [] 2025-12-04T09:31:25.7601331Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7601385Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7601484Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7601824Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7601862Z graph_break [] 2025-12-04T09:31:25.7601935Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7601990Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7602087Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7602447Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7602487Z graph_break [] 2025-12-04T09:31:25.7602561Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7602619Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7602714Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7603058Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7603094Z graph_break [] 2025-12-04T09:31:25.7603172Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7603226Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7603346Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7603689Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7603726Z graph_break [] 2025-12-04T09:31:25.7603799Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7603859Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7603954Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7604299Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7604338Z graph_break [] 2025-12-04T09:31:25.7604411Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7604467Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7604563Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7604907Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7604943Z graph_break [] 2025-12-04T09:31:25.7605019Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7605076Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7605174Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7605524Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7605562Z graph_break [] 2025-12-04T09:31:25.7605634Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7605692Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7605788Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7606201Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7606239Z graph_break [] 2025-12-04T09:31:25.7606314Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7606370Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7606467Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7606810Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7606846Z graph_break [] 2025-12-04T09:31:25.7606921Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7606976Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7607074Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7607445Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7607483Z graph_break [] 2025-12-04T09:31:25.7607557Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7607613Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7607708Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7608053Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7608091Z graph_break [] 2025-12-04T09:31:25.7608173Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7608219Z Traceback (most recent call last): 2025-12-04T09:31:25.7608346Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7608384Z self.common( 2025-12-04T09:31:25.7608475Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7608519Z return func(*args, **kwds) 2025-12-04T09:31:25.7608649Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 725, in check_model_gpu 2025-12-04T09:31:25.7608685Z check_model( 2025-12-04T09:31:25.7608804Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7608843Z assert_equal_fn( 2025-12-04T09:31:25.7608983Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7609046Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7609206Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7609279Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7609335Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7609337Z 2025-12-04T09:31:25.7609382Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7609480Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7609599Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7609603Z 2025-12-04T09:31:25.7609649Z The failure occurred for item [2] 2025-12-04T09:31:25.7609651Z 2025-12-04T09:31:25.7609726Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7609875Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7609877Z 2025-12-04T09:31:25.7609968Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7610041Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7610098Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7610196Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7610542Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7610578Z graph_break [] 2025-12-04T09:31:25.7610652Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7610734Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7610832Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7611173Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7611212Z graph_break [] 2025-12-04T09:31:25.7611284Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7611343Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7611439Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7611785Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7611824Z graph_break [] 2025-12-04T09:31:25.7611896Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7611954Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7612049Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7612394Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7612431Z graph_break [] 2025-12-04T09:31:25.7612505Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7612563Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7612659Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7613006Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7613044Z graph_break [] 2025-12-04T09:31:25.7613117Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7613174Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7613295Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7613645Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7613683Z graph_break [] 2025-12-04T09:31:25.7613758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7613812Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7613908Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7614250Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7614287Z graph_break [] 2025-12-04T09:31:25.7614361Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7614437Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7614532Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7614874Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7614912Z graph_break [] 2025-12-04T09:31:25.7614985Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7615042Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7615139Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7615485Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7615524Z graph_break [] 2025-12-04T09:31:25.7615599Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7615653Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7615751Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7616172Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7616210Z graph_break [] 2025-12-04T09:31:25.7616283Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7616340Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7616435Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7616778Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7616816Z graph_break [] 2025-12-04T09:31:25.7616888Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7616944Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7617071Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7617415Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7617453Z graph_break [] 2025-12-04T09:31:25.7617530Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7617584Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7617679Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7618027Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7618065Z graph_break [] 2025-12-04T09:31:25.7618138Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7618224Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7618319Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7618663Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7618701Z graph_break [] 2025-12-04T09:31:25.7618774Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7618832Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7618927Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7619272Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7619310Z graph_break [] 2025-12-04T09:31:25.7619384Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7619439Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7619535Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7619880Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7619918Z graph_break [] 2025-12-04T09:31:25.7619991Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7620050Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7620145Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7620492Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7620529Z graph_break [] 2025-12-04T09:31:25.7620603Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7620659Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7620776Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7621119Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7621158Z graph_break [] 2025-12-04T09:31:25.7621231Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7621288Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7621385Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7621726Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7621764Z graph_break [] 2025-12-04T09:31:25.7621837Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7621914Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7622010Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7622356Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7622393Z graph_break [] 2025-12-04T09:31:25.7622467Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7622522Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7622620Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7622963Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7623003Z graph_break [] 2025-12-04T09:31:25.7623075Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7623131Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7623226Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7623573Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7623610Z graph_break [] 2025-12-04T09:31:25.7623683Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7623740Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7623835Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7624184Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7624220Z graph_break [] 2025-12-04T09:31:25.7624294Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7624351Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7624469Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7624812Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7624851Z graph_break [] 2025-12-04T09:31:25.7624925Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7624980Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7625075Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7625420Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7625456Z graph_break [] 2025-12-04T09:31:25.7625530Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7625607Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7625703Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7626151Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7626188Z graph_break [] 2025-12-04T09:31:25.7626263Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7626320Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7626418Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7626763Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7626802Z graph_break [] 2025-12-04T09:31:25.7626875Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7626933Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7627027Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7627374Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7627410Z graph_break [] 2025-12-04T09:31:25.7627484Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7627538Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7627637Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7627980Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7628018Z graph_break [] 2025-12-04T09:31:25.7628090Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7628151Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7628285Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7628632Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7628674Z graph_break [] 2025-12-04T09:31:25.7628749Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7628807Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7628903Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7629252Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7629290Z graph_break [] 2025-12-04T09:31:25.7629368Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7629423Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7629548Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7629893Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7629934Z graph_break [] 2025-12-04T09:31:25.7630008Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7630070Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7630167Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7630517Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7630560Z graph_break [] 2025-12-04T09:31:25.7630635Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7630694Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7630791Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7631137Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7631175Z graph_break [] 2025-12-04T09:31:25.7631252Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7631311Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7631412Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7631758Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7631799Z graph_break [] 2025-12-04T09:31:25.7631874Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7631932Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7632052Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7632399Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7632439Z graph_break [] 2025-12-04T09:31:25.7632516Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7632574Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7632674Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7633019Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7633062Z graph_break [] 2025-12-04T09:31:25.7633140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7633198Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7633324Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7633668Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7633709Z graph_break [] 2025-12-04T09:31:25.7633783Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7633843Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7633942Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7634289Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7634329Z graph_break [] 2025-12-04T09:31:25.7634406Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7634462Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7634561Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7634907Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7634950Z graph_break [] 2025-12-04T09:31:25.7635024Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7635083Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7635181Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7635528Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7635569Z graph_break [] 2025-12-04T09:31:25.7635644Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7635705Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7635801Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7664938Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7665053Z graph_break [] 2025-12-04T09:31:25.7665163Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7665215Z Traceback (most recent call last): 2025-12-04T09:31:25.7665360Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7665400Z self.common( 2025-12-04T09:31:25.7665503Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7665551Z return func(*args, **kwds) 2025-12-04T09:31:25.7665685Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7665725Z check_model( 2025-12-04T09:31:25.7665851Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7665890Z assert_equal_fn( 2025-12-04T09:31:25.7666192Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7666257Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7666424Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7666501Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7666559Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7666563Z 2025-12-04T09:31:25.7666610Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7666720Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7666826Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7666828Z 2025-12-04T09:31:25.7666878Z The failure occurred for item [2] 2025-12-04T09:31:25.7666881Z 2025-12-04T09:31:25.7666959Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7667113Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7667115Z 2025-12-04T09:31:25.7667209Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7667291Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7667354Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7667460Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7667822Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7667862Z graph_break [] 2025-12-04T09:31:25.7667945Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7668007Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7668110Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7668464Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7668505Z graph_break [] 2025-12-04T09:31:25.7668581Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7668681Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7668780Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7669128Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7669166Z graph_break [] 2025-12-04T09:31:25.7669242Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7669300Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7669401Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7669749Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7669787Z graph_break [] 2025-12-04T09:31:25.7669891Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7669948Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7670048Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7670396Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7670437Z graph_break [] 2025-12-04T09:31:25.7670512Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7670575Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7670673Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7671019Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7671058Z graph_break [] 2025-12-04T09:31:25.7671135Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7671191Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7671290Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7671635Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7671676Z graph_break [] 2025-12-04T09:31:25.7671752Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7671812Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7671908Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7672254Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7672295Z graph_break [] 2025-12-04T09:31:25.7672369Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7672454Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7672551Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7672904Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7672944Z graph_break [] 2025-12-04T09:31:25.7673021Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7673076Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7673177Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7673526Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7673570Z graph_break [] 2025-12-04T09:31:25.7673664Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7673722Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7673818Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7674164Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7674202Z graph_break [] 2025-12-04T09:31:25.7674278Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7674335Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7674435Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7674784Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7674824Z graph_break [] 2025-12-04T09:31:25.7674901Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7674955Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7675054Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7675398Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7675439Z graph_break [] 2025-12-04T09:31:25.7675512Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7675574Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7675671Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7676062Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7676100Z graph_break [] 2025-12-04T09:31:25.7676178Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7676264Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7676365Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7676710Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7676753Z graph_break [] 2025-12-04T09:31:25.7676827Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7676885Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7676985Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7677330Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7677371Z graph_break [] 2025-12-04T09:31:25.7677444Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7677545Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7677641Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7677986Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7678023Z graph_break [] 2025-12-04T09:31:25.7678100Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7678159Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7678258Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7678604Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7678646Z graph_break [] 2025-12-04T09:31:25.7678720Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7678780Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7678876Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7679226Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7679267Z graph_break [] 2025-12-04T09:31:25.7679340Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7679399Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7679495Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7679839Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7679877Z graph_break [] 2025-12-04T09:31:25.7679953Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7680031Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7680131Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7680475Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7680517Z graph_break [] 2025-12-04T09:31:25.7680590Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7680648Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7680744Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7681092Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7681129Z graph_break [] 2025-12-04T09:31:25.7681205Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7681281Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7681380Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7681723Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7681764Z graph_break [] 2025-12-04T09:31:25.7681839Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7681898Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7681999Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7682342Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7682385Z graph_break [] 2025-12-04T09:31:25.7682458Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7682517Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7682614Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7682962Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7682999Z graph_break [] 2025-12-04T09:31:25.7683077Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7683133Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7683232Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7683576Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7683616Z graph_break [] 2025-12-04T09:31:25.7683689Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7683750Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7683867Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7684213Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7684256Z graph_break [] 2025-12-04T09:31:25.7684329Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7684389Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7684486Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7684835Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7684872Z graph_break [] 2025-12-04T09:31:25.7684948Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7685025Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7685124Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7685471Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7685511Z graph_break [] 2025-12-04T09:31:25.7685584Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7685643Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7685741Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7686160Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7686199Z graph_break [] 2025-12-04T09:31:25.7686276Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7686332Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7686432Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7686779Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7686817Z graph_break [] 2025-12-04T09:31:25.7686894Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7686951Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7687050Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7687395Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7687436Z graph_break [] 2025-12-04T09:31:25.7687509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7687570Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7687697Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7688044Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7688082Z graph_break [] 2025-12-04T09:31:25.7688158Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7688212Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7688311Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7688656Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7688697Z graph_break [] 2025-12-04T09:31:25.7688771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7688861Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7688959Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7689305Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7689346Z graph_break [] 2025-12-04T09:31:25.7689420Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7689478Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7689578Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7689925Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7689964Z graph_break [] 2025-12-04T09:31:25.7690041Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7690099Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7690198Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7690541Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7690582Z graph_break [] 2025-12-04T09:31:25.7690655Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7690717Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7690813Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7691161Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7691201Z graph_break [] 2025-12-04T09:31:25.7691274Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7691333Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7691457Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7691809Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7691848Z graph_break [] 2025-12-04T09:31:25.7691924Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7691980Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7692079Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7692424Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7692464Z graph_break [] 2025-12-04T09:31:25.7692537Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7692616Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7692712Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7693058Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7693095Z graph_break [] 2025-12-04T09:31:25.7693173Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7693230Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7693332Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7693677Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7693719Z graph_break [] 2025-12-04T09:31:25.7693795Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7693853Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7693955Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7694299Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7694339Z graph_break [] 2025-12-04T09:31:25.7694412Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7694473Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7694571Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7694917Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7694954Z graph_break [] 2025-12-04T09:31:25.7695039Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7695087Z Traceback (most recent call last): 2025-12-04T09:31:25.7695242Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7695282Z self.common( 2025-12-04T09:31:25.7695376Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7695423Z return func(*args, **kwds) 2025-12-04T09:31:25.7695554Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 725, in check_model_gpu 2025-12-04T09:31:25.7695592Z check_model( 2025-12-04T09:31:25.7695712Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7695752Z assert_equal_fn( 2025-12-04T09:31:25.7695897Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7696031Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7696200Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7696277Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7696338Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7696341Z 2025-12-04T09:31:25.7696388Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7696519Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7696615Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7696617Z 2025-12-04T09:31:25.7696668Z The failure occurred for item [2] 2025-12-04T09:31:25.7696670Z 2025-12-04T09:31:25.7696748Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7696899Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7696902Z 2025-12-04T09:31:25.7696995Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7697073Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7697133Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7697233Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7697585Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7697624Z graph_break [] 2025-12-04T09:31:25.7697702Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7697767Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7697870Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7698218Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7698262Z graph_break [] 2025-12-04T09:31:25.7698337Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7698398Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7698495Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7698846Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7698887Z graph_break [] 2025-12-04T09:31:25.7698991Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7699053Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7699150Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7699498Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7699536Z graph_break [] 2025-12-04T09:31:25.7699614Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7699671Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7699771Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7700117Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7700180Z graph_break [] 2025-12-04T09:31:25.7700254Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7700314Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7700412Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7700759Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7700797Z graph_break [] 2025-12-04T09:31:25.7700877Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7700933Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7701033Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7701380Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7701422Z graph_break [] 2025-12-04T09:31:25.7701499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7701554Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7701654Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7702003Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7702046Z graph_break [] 2025-12-04T09:31:25.7702121Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7702182Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7702280Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7702632Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7702670Z graph_break [] 2025-12-04T09:31:25.7702768Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7702824Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7702924Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7703270Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7703312Z graph_break [] 2025-12-04T09:31:25.7703386Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7703445Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7703542Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7703890Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7703958Z graph_break [] 2025-12-04T09:31:25.7704034Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7704093Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7704189Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7704535Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7704573Z graph_break [] 2025-12-04T09:31:25.7704651Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7704708Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7704808Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7705154Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7705196Z graph_break [] 2025-12-04T09:31:25.7705270Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7705332Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7705428Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7705778Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7705816Z graph_break [] 2025-12-04T09:31:25.7705895Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7706047Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7706148Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7706496Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7706535Z graph_break [] 2025-12-04T09:31:25.7706612Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7706701Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7706803Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7707153Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7707195Z graph_break [] 2025-12-04T09:31:25.7707269Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7707331Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7707427Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7707775Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7707813Z graph_break [] 2025-12-04T09:31:25.7707917Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7707976Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7708077Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7708422Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7708464Z graph_break [] 2025-12-04T09:31:25.7708538Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7708601Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7708698Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7709057Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7709101Z graph_break [] 2025-12-04T09:31:25.7709176Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7709235Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7709332Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7709681Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7709719Z graph_break [] 2025-12-04T09:31:25.7709797Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7709854Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7709954Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7710298Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7710340Z graph_break [] 2025-12-04T09:31:25.7710414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7710497Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7710595Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7710944Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7710987Z graph_break [] 2025-12-04T09:31:25.7711062Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7711121Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7711218Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7711573Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7711611Z graph_break [] 2025-12-04T09:31:25.7711709Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7711766Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7711866Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7712212Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7712254Z graph_break [] 2025-12-04T09:31:25.7712327Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7712387Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7712484Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7712832Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7712871Z graph_break [] 2025-12-04T09:31:25.7712952Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7713009Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7713111Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7713465Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7713506Z graph_break [] 2025-12-04T09:31:25.7713584Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7713643Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7713743Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7714089Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7714131Z graph_break [] 2025-12-04T09:31:25.7714204Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7714286Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7714384Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7714731Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7714772Z graph_break [] 2025-12-04T09:31:25.7714848Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7714904Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7715004Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7715353Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7715395Z graph_break [] 2025-12-04T09:31:25.7715469Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7715551Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7715649Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7716042Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7716084Z graph_break [] 2025-12-04T09:31:25.7716158Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7716222Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7716320Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7716670Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7716711Z graph_break [] 2025-12-04T09:31:25.7716792Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7716849Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7716954Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7717303Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7717348Z graph_break [] 2025-12-04T09:31:25.7717423Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7717489Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7717587Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7717942Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7717981Z graph_break [] 2025-12-04T09:31:25.7718060Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7718142Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7718247Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7718597Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7718639Z graph_break [] 2025-12-04T09:31:25.7718720Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7718781Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7718884Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7719236Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7719283Z graph_break [] 2025-12-04T09:31:25.7719359Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7719455Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7719554Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7719911Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7719951Z graph_break [] 2025-12-04T09:31:25.7720033Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7720093Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7720199Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7720545Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7720592Z graph_break [] 2025-12-04T09:31:25.7720668Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7720733Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7720832Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7721183Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7721231Z graph_break [] 2025-12-04T09:31:25.7721306Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7721373Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7721471Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7721823Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7721863Z graph_break [] 2025-12-04T09:31:25.7721945Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7722003Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7722128Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7722474Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7722521Z graph_break [] 2025-12-04T09:31:25.7722597Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7722660Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7722759Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7723114Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7723159Z graph_break [] 2025-12-04T09:31:25.7723235Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7723324Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7723423Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7723774Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7723814Z graph_break [] 2025-12-04T09:31:25.7723897Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7723957Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7724062Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7724413Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7724461Z graph_break [] 2025-12-04T09:31:25.7724537Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7724601Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7724699Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7725050Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7725091Z graph_break [] 2025-12-04T09:31:25.7725173Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7725235Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7725340Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7725686Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7725732Z graph_break [] 2025-12-04T09:31:25.7725823Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7725873Z Traceback (most recent call last): 2025-12-04T09:31:25.7726068Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7726111Z self.common( 2025-12-04T09:31:25.7726209Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7726258Z return func(*args, **kwds) 2025-12-04T09:31:25.7726393Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 725, in check_model_gpu 2025-12-04T09:31:25.7726434Z check_model( 2025-12-04T09:31:25.7726560Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7726602Z assert_equal_fn( 2025-12-04T09:31:25.7726754Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7726818Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7726988Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7727063Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7727125Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7727128Z 2025-12-04T09:31:25.7727203Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7727308Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7727406Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7727408Z 2025-12-04T09:31:25.7727462Z The failure occurred for item [2] 2025-12-04T09:31:25.7727464Z 2025-12-04T09:31:25.7727540Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7727694Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7727697Z 2025-12-04T09:31:25.7727789Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7727871Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7727930Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7728036Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7728384Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7728429Z graph_break [] 2025-12-04T09:31:25.7728504Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7728569Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7728668Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7729024Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7729073Z graph_break [] 2025-12-04T09:31:25.7729149Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7729213Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7729312Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7729663Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7729703Z graph_break [] 2025-12-04T09:31:25.7729806Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7729866Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7729971Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7730318Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7730364Z graph_break [] 2025-12-04T09:31:25.7730440Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7730505Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7730604Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7730968Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7731037Z graph_break [] 2025-12-04T09:31:25.7731112Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7731176Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7731275Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7731629Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7731669Z graph_break [] 2025-12-04T09:31:25.7731752Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7731811Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7731916Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7732265Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7732312Z graph_break [] 2025-12-04T09:31:25.7732387Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7732450Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7732549Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7732904Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7732945Z graph_break [] 2025-12-04T09:31:25.7733028Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7733088Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7733192Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7733538Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7733583Z graph_break [] 2025-12-04T09:31:25.7733685Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7733743Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7733848Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7734198Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7734244Z graph_break [] 2025-12-04T09:31:25.7734319Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7734383Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7734482Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7734835Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7734903Z graph_break [] 2025-12-04T09:31:25.7734986Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7735044Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7735148Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7735498Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7735545Z graph_break [] 2025-12-04T09:31:25.7735622Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7735685Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7735785Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7736189Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7736236Z graph_break [] 2025-12-04T09:31:25.7736310Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7736376Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7736475Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7736829Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7736871Z graph_break [] 2025-12-04T09:31:25.7736952Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7737012Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7737117Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7737465Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7737512Z graph_break [] 2025-12-04T09:31:25.7737616Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7737680Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7737778Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7738131Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7738170Z graph_break [] 2025-12-04T09:31:25.7738250Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7738309Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7738412Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7738767Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7738832Z graph_break [] 2025-12-04T09:31:25.7738915Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7738974Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7739077Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7739423Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7739469Z graph_break [] 2025-12-04T09:31:25.7739548Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7739613Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7739712Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7740067Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7740105Z graph_break [] 2025-12-04T09:31:25.7740183Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7740239Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7740338Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7740686Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7740729Z graph_break [] 2025-12-04T09:31:25.7740802Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7740862Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7744968Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7745326Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7745369Z graph_break [] 2025-12-04T09:31:25.7745487Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7745546Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7745648Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7746042Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7746085Z graph_break [] 2025-12-04T09:31:25.7746161Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7746223Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7746322Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7746673Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7746745Z graph_break [] 2025-12-04T09:31:25.7746824Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7746882Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7746982Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7747327Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7747367Z graph_break [] 2025-12-04T09:31:25.7747444Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7747502Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7747597Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7747947Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7747985Z graph_break [] 2025-12-04T09:31:25.7748058Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7748113Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7748209Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7748553Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7748590Z graph_break [] 2025-12-04T09:31:25.7748669Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7748727Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7748825Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7749168Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7749208Z graph_break [] 2025-12-04T09:31:25.7749282Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7749373Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7749470Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7749816Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7749852Z graph_break [] 2025-12-04T09:31:25.7749927Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7749983Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7750080Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7750426Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7750463Z graph_break [] 2025-12-04T09:31:25.7750562Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7750616Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7750714Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7751058Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7751096Z graph_break [] 2025-12-04T09:31:25.7751168Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7751226Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7751321Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7751667Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7751704Z graph_break [] 2025-12-04T09:31:25.7751779Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7751833Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7751933Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7752281Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7752320Z graph_break [] 2025-12-04T09:31:25.7752393Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7752454Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7752550Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7752894Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7752933Z graph_break [] 2025-12-04T09:31:25.7753004Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7753089Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7753185Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7753528Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7753565Z graph_break [] 2025-12-04T09:31:25.7753640Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7753697Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7753796Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7754141Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7754180Z graph_break [] 2025-12-04T09:31:25.7754275Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7754332Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7754428Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7754771Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7754808Z graph_break [] 2025-12-04T09:31:25.7754882Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7754941Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7755038Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7755384Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7755423Z graph_break [] 2025-12-04T09:31:25.7755498Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7755554Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7755651Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7756050Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7756090Z graph_break [] 2025-12-04T09:31:25.7756164Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7756222Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7756318Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7756668Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7756704Z graph_break [] 2025-12-04T09:31:25.7756779Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7756861Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7756958Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7757300Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7757343Z graph_break [] 2025-12-04T09:31:25.7757414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7757473Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7757571Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7757916Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7757953Z graph_break [] 2025-12-04T09:31:25.7758027Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7758114Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7758211Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7758560Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7758596Z graph_break [] 2025-12-04T09:31:25.7758670Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7758728Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7758825Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7759170Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7759210Z graph_break [] 2025-12-04T09:31:25.7759282Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7759337Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7759432Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7759779Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7759817Z graph_break [] 2025-12-04T09:31:25.7759891Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7759948Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7760048Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7760395Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7760430Z graph_break [] 2025-12-04T09:31:25.7760505Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7760581Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7760680Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7761023Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7761062Z graph_break [] 2025-12-04T09:31:25.7761143Z _______________ GPUTests.test_var_mean_tile_reduction_True_cuda ________________ 2025-12-04T09:31:25.7761192Z Traceback (most recent call last): 2025-12-04T09:31:25.7761325Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:31:25.7761365Z self.common( 2025-12-04T09:31:25.7761457Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:31:25.7761508Z return func(*args, **kwds) 2025-12-04T09:31:25.7761638Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:31:25.7761677Z check_model( 2025-12-04T09:31:25.7761792Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:31:25.7761856Z assert_equal_fn( 2025-12-04T09:31:25.7761998Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:31:25.7762062Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:31:25.7762223Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:31:25.7762298Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:31:25.7762352Z AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7762355Z 2025-12-04T09:31:25.7762401Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7762506Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7762613Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7762617Z 2025-12-04T09:31:25.7762664Z The failure occurred for item [2] 2025-12-04T09:31:25.7762666Z 2025-12-04T09:31:25.7762743Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7762896Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7762900Z 2025-12-04T09:31:25.7762988Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7763064Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7763120Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7763219Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7763565Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7763605Z graph_break [] 2025-12-04T09:31:25.7763678Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7763739Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7763834Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7764179Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7764236Z graph_break [] 2025-12-04T09:31:25.7764311Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7764368Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7764467Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7764810Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7764850Z graph_break [] 2025-12-04T09:31:25.7764923Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7764982Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7765078Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7765425Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7765488Z graph_break [] 2025-12-04T09:31:25.7765560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7765618Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7765714Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7766143Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7766182Z graph_break [] 2025-12-04T09:31:25.7766256Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7766311Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7766410Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7766754Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7766793Z graph_break [] 2025-12-04T09:31:25.7766864Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7766919Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7767016Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7767369Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7767406Z graph_break [] 2025-12-04T09:31:25.7767481Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7767535Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7767631Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7767974Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7768009Z graph_break [] 2025-12-04T09:31:25.7768118Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7768174Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7768273Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7768615Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7768652Z graph_break [] 2025-12-04T09:31:25.7768724Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7768781Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7768876Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7769224Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7769288Z graph_break [] 2025-12-04T09:31:25.7769361Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7769415Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7769511Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7769852Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7769890Z graph_break [] 2025-12-04T09:31:25.7769963Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7770020Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7770114Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7770459Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7770498Z graph_break [] 2025-12-04T09:31:25.7770570Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7770624Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7770719Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7771064Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7771101Z graph_break [] 2025-12-04T09:31:25.7771176Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7771231Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7771330Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7771673Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7771711Z graph_break [] 2025-12-04T09:31:25.7771828Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7771886Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7771981Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7772329Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7772365Z graph_break [] 2025-12-04T09:31:25.7772440Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7772493Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7772590Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7772936Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7772996Z graph_break [] 2025-12-04T09:31:25.7773071Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7773127Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7773223Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7773569Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7773612Z graph_break [] 2025-12-04T09:31:25.7773685Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7773743Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7773838Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7774184Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7774220Z graph_break [] 2025-12-04T09:31:25.7774294Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7774348Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7774445Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7774792Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7774830Z graph_break [] 2025-12-04T09:31:25.7774904Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7774958Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7775054Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7775402Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7775440Z graph_break [] 2025-12-04T09:31:25.7775532Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7775590Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7775684Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7776094Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7776130Z graph_break [] 2025-12-04T09:31:25.7776204Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7776258Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7776354Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7776699Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7776766Z graph_break [] 2025-12-04T09:31:25.7776838Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7776895Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7776989Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7777334Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7777371Z graph_break [] 2025-12-04T09:31:25.7777445Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7777502Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7777597Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7777950Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7777985Z graph_break [] 2025-12-04T09:31:25.7778059Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7778114Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7778211Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7778553Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7778592Z graph_break [] 2025-12-04T09:31:25.7778664Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7778719Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7778813Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7779156Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7779191Z graph_break [] 2025-12-04T09:31:25.7779295Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7779351Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7779447Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7779797Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7779833Z graph_break [] 2025-12-04T09:31:25.7779907Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7779963Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7780061Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7780404Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7780472Z graph_break [] 2025-12-04T09:31:25.7780544Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7780600Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7780695Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7781040Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7781078Z graph_break [] 2025-12-04T09:31:25.7781153Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7781206Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7781303Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7781647Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7781689Z graph_break [] 2025-12-04T09:31:25.7781764Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7781820Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7781915Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7782263Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7782303Z graph_break [] 2025-12-04T09:31:25.7782375Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7782432Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7782526Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7782871Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7782909Z graph_break [] 2025-12-04T09:31:25.7783009Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7783066Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7783163Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7783507Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7783546Z graph_break [] 2025-12-04T09:31:25.7783618Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7783673Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7783770Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7784122Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7784179Z graph_break [] 2025-12-04T09:31:25.7784263Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7784319Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7784417Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7784761Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7784796Z graph_break [] 2025-12-04T09:31:25.7784872Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7784928Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7785023Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7785379Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7785415Z graph_break [] 2025-12-04T09:31:25.7785489Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7785544Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7785640Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7786039Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7786077Z graph_break [] 2025-12-04T09:31:25.7786150Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7786206Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7786302Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7786646Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7786683Z graph_break [] 2025-12-04T09:31:25.7786788Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7786845Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7786941Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7787286Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7787321Z graph_break [] 2025-12-04T09:31:25.7787393Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7787446Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7787542Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7787884Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7787953Z graph_break [] 2025-12-04T09:31:25.7788026Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7788080Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7788175Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7788518Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7788555Z graph_break [] 2025-12-04T09:31:25.7788626Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7788685Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7788779Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7789126Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7789162Z graph_break [] 2025-12-04T09:31:25.7789234Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7789290Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7789386Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7789729Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7789766Z graph_break [] 2025-12-04T09:31:25.7789838Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7789892Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7789986Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7790334Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7790370Z graph_break [] 2025-12-04T09:31:25.7790442Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7790519Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7790614Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7790960Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7790996Z graph_break [] 2025-12-04T09:31:25.7791067Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7791121Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7791216Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7791559Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7791596Z graph_break [] 2025-12-04T09:31:25.7791687Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7791744Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:31:25.7791838Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:31:25.7792182Z inductor [('triton_bundler_save_kernel', 80), ('benchmarking.InductorBenchmarker.benchmark', 10), ('benchmarking.InductorBenchmarker.benchmark_gpu', 10), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:31:25.7792216Z graph_break [] 2025-12-04T09:31:25.7792291Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:31:25.7792345Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:31:25.7792443Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:31:25.7792788Z inductor [('triton_bundler_save_kernel', 40), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:31:25.7792825Z graph_break [] 2025-12-04T09:31:25.7793053Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml - 2025-12-04T09:31:25.7793113Z =========================== short test summary info ============================ 2025-12-04T09:31:25.7793329Z FAILED [0.5829s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7793333Z 2025-12-04T09:31:25.7793380Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7793482Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7793584Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7793586Z 2025-12-04T09:31:25.7793631Z The failure occurred for item [2] 2025-12-04T09:31:25.7793633Z 2025-12-04T09:31:25.7793706Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7793853Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7793855Z 2025-12-04T09:31:25.7793943Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7794120Z FAILED [0.5071s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7794144Z 2025-12-04T09:31:25.7794188Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7794284Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7794376Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7794380Z 2025-12-04T09:31:25.7794425Z The failure occurred for item [2] 2025-12-04T09:31:25.7794427Z 2025-12-04T09:31:25.7794502Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7794648Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7794650Z 2025-12-04T09:31:25.7794735Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7794907Z FAILED [0.2440s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7794909Z 2025-12-04T09:31:25.7794954Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7795052Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7795152Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7795195Z 2025-12-04T09:31:25.7795238Z The failure occurred for item [2] 2025-12-04T09:31:25.7795240Z 2025-12-04T09:31:25.7795313Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7795456Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7795458Z 2025-12-04T09:31:25.7795542Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7795713Z FAILED [0.2472s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7795715Z 2025-12-04T09:31:25.7795761Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7795858Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7796004Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7796008Z 2025-12-04T09:31:25.7796052Z The failure occurred for item [2] 2025-12-04T09:31:25.7796054Z 2025-12-04T09:31:25.7796124Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7796268Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7796270Z 2025-12-04T09:31:25.7796353Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7796527Z FAILED [0.2453s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7796530Z 2025-12-04T09:31:25.7796572Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7796673Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7796774Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7796777Z 2025-12-04T09:31:25.7796822Z The failure occurred for item [2] 2025-12-04T09:31:25.7796824Z 2025-12-04T09:31:25.7796895Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7797039Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7797041Z 2025-12-04T09:31:25.7797127Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7797300Z FAILED [0.2768s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7797302Z 2025-12-04T09:31:25.7797346Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7797474Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7797572Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7797576Z 2025-12-04T09:31:25.7797618Z The failure occurred for item [2] 2025-12-04T09:31:25.7797619Z 2025-12-04T09:31:25.7797690Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7797832Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7797834Z 2025-12-04T09:31:25.7797919Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7798090Z FAILED [0.2506s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7798094Z 2025-12-04T09:31:25.7798137Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7798236Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7798335Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7798337Z 2025-12-04T09:31:25.7798408Z The failure occurred for item [2] 2025-12-04T09:31:25.7798409Z 2025-12-04T09:31:25.7798481Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7798624Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7798628Z 2025-12-04T09:31:25.7798711Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7798884Z FAILED [0.2686s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7798886Z 2025-12-04T09:31:25.7798927Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7799025Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7799122Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7799124Z 2025-12-04T09:31:25.7799168Z The failure occurred for item [2] 2025-12-04T09:31:25.7799172Z 2025-12-04T09:31:25.7799242Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7799385Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7799387Z 2025-12-04T09:31:25.7799470Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7799641Z FAILED [0.2526s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7799643Z 2025-12-04T09:31:25.7799685Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7799783Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7799880Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7799882Z 2025-12-04T09:31:25.7799925Z The failure occurred for item [2] 2025-12-04T09:31:25.7799929Z 2025-12-04T09:31:25.7799999Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7800145Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7800147Z 2025-12-04T09:31:25.7800230Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7800400Z FAILED [0.2467s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7800402Z 2025-12-04T09:31:25.7800444Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7800541Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7800660Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7800662Z 2025-12-04T09:31:25.7800704Z The failure occurred for item [2] 2025-12-04T09:31:25.7800708Z 2025-12-04T09:31:25.7800780Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7800924Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7800926Z 2025-12-04T09:31:25.7801010Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7801182Z FAILED [0.2504s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7801184Z 2025-12-04T09:31:25.7801229Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7801325Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7801428Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7801430Z 2025-12-04T09:31:25.7801473Z The failure occurred for item [2] 2025-12-04T09:31:25.7801478Z 2025-12-04T09:31:25.7801576Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7801719Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7801721Z 2025-12-04T09:31:25.7801804Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7801978Z FAILED [0.2496s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7801980Z 2025-12-04T09:31:25.7802023Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7802119Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7802218Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7802219Z 2025-12-04T09:31:25.7802263Z The failure occurred for item [2] 2025-12-04T09:31:25.7802265Z 2025-12-04T09:31:25.7802335Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7802481Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7802483Z 2025-12-04T09:31:25.7802566Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7802739Z FAILED [0.2445s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7802741Z 2025-12-04T09:31:25.7802784Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7802883Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7802983Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7802988Z 2025-12-04T09:31:25.7803029Z The failure occurred for item [2] 2025-12-04T09:31:25.7803031Z 2025-12-04T09:31:25.7803106Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7803249Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7803251Z 2025-12-04T09:31:25.7803340Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7803513Z FAILED [0.5056s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7803515Z 2025-12-04T09:31:25.7803559Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7803654Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7803776Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7803778Z 2025-12-04T09:31:25.7803823Z The failure occurred for item [2] 2025-12-04T09:31:25.7803825Z 2025-12-04T09:31:25.7803898Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7804044Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7804046Z 2025-12-04T09:31:25.7804135Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7804307Z FAILED [0.2761s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7804310Z 2025-12-04T09:31:25.7804352Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7804450Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7804549Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7804552Z 2025-12-04T09:31:25.7804595Z The failure occurred for item [2] 2025-12-04T09:31:25.7804597Z 2025-12-04T09:31:25.7804668Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7804830Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7804832Z 2025-12-04T09:31:25.7804915Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7805089Z FAILED [0.2430s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7805091Z 2025-12-04T09:31:25.7805133Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7805229Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7805325Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7805327Z 2025-12-04T09:31:25.7805373Z The failure occurred for item [2] 2025-12-04T09:31:25.7805375Z 2025-12-04T09:31:25.7805446Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7805589Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7805593Z 2025-12-04T09:31:25.7805679Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7805851Z FAILED [0.5208s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7805853Z 2025-12-04T09:31:25.7805896Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7806018Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7806111Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7806113Z 2025-12-04T09:31:25.7806157Z The failure occurred for item [2] 2025-12-04T09:31:25.7806159Z 2025-12-04T09:31:25.7806231Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7806372Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7806376Z 2025-12-04T09:31:25.7806462Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7806632Z FAILED [0.5114s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7806634Z 2025-12-04T09:31:25.7806679Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7806769Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7806859Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7806861Z 2025-12-04T09:31:25.7806903Z The failure occurred for item [2] 2025-12-04T09:31:25.7806955Z 2025-12-04T09:31:25.7807028Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7807171Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7807175Z 2025-12-04T09:31:25.7807258Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7807432Z FAILED [0.2585s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7807433Z 2025-12-04T09:31:25.7807477Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7807577Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7807675Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7807677Z 2025-12-04T09:31:25.7807720Z The failure occurred for item [2] 2025-12-04T09:31:25.7807721Z 2025-12-04T09:31:25.7807792Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7807940Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7807972Z 2025-12-04T09:31:25.7808056Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7808228Z FAILED [0.6904s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7808231Z 2025-12-04T09:31:25.7808273Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7808370Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7808468Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7808473Z 2025-12-04T09:31:25.7808516Z The failure occurred for item [2] 2025-12-04T09:31:25.7808518Z 2025-12-04T09:31:25.7808592Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7808735Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7808739Z 2025-12-04T09:31:25.7808824Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7808994Z FAILED [0.2427s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7808996Z 2025-12-04T09:31:25.7809041Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7809137Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7809235Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7809237Z 2025-12-04T09:31:25.7809279Z The failure occurred for item [2] 2025-12-04T09:31:25.7809280Z 2025-12-04T09:31:25.7809354Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7809495Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7809497Z 2025-12-04T09:31:25.7809586Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7809756Z FAILED [0.2631s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7809758Z 2025-12-04T09:31:25.7809802Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7809898Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7809998Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7810000Z 2025-12-04T09:31:25.7810043Z The failure occurred for item [2] 2025-12-04T09:31:25.7810045Z 2025-12-04T09:31:25.7810143Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7810288Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7810290Z 2025-12-04T09:31:25.7810373Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7810544Z FAILED [0.2497s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7810545Z 2025-12-04T09:31:25.7810587Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7810686Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7810784Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7810786Z 2025-12-04T09:31:25.7810830Z The failure occurred for item [2] 2025-12-04T09:31:25.7810832Z 2025-12-04T09:31:25.7810901Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7811048Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7811050Z 2025-12-04T09:31:25.7811133Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7811329Z FAILED [0.5193s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7811331Z 2025-12-04T09:31:25.7811374Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7811467Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7811562Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7811564Z 2025-12-04T09:31:25.7811609Z The failure occurred for item [2] 2025-12-04T09:31:25.7811610Z 2025-12-04T09:31:25.7811683Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7811828Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7811830Z 2025-12-04T09:31:25.7811916Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7812092Z FAILED [0.2388s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7812094Z 2025-12-04T09:31:25.7812138Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7812234Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7812332Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7812334Z 2025-12-04T09:31:25.7812377Z The failure occurred for item [2] 2025-12-04T09:31:25.7812379Z 2025-12-04T09:31:25.7812451Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7812596Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7812600Z 2025-12-04T09:31:25.7812684Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7812860Z FAILED [0.5305s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7812861Z 2025-12-04T09:31:25.7812904Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7812996Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7813087Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7813089Z 2025-12-04T09:31:25.7813132Z The failure occurred for item [2] 2025-12-04T09:31:25.7813134Z 2025-12-04T09:31:25.7813205Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7813372Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7813375Z 2025-12-04T09:31:25.7813457Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7813629Z FAILED [0.4855s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7813633Z 2025-12-04T09:31:25.7813675Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7813767Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7813856Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7813857Z 2025-12-04T09:31:25.7813902Z The failure occurred for item [2] 2025-12-04T09:31:25.7813904Z 2025-12-04T09:31:25.7813974Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7814119Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7814121Z 2025-12-04T09:31:25.7814205Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7814376Z FAILED [0.2603s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7814409Z 2025-12-04T09:31:25.7814453Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7814550Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7814649Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7814651Z 2025-12-04T09:31:25.7814693Z The failure occurred for item [2] 2025-12-04T09:31:25.7814695Z 2025-12-04T09:31:25.7814768Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7814910Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7814913Z 2025-12-04T09:31:25.7814999Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7815168Z FAILED [0.2563s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7815172Z 2025-12-04T09:31:25.7815216Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7815313Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7815413Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7815415Z 2025-12-04T09:31:25.7815457Z The failure occurred for item [2] 2025-12-04T09:31:25.7815459Z 2025-12-04T09:31:25.7815530Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7815675Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7815677Z 2025-12-04T09:31:25.7815761Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7815971Z FAILED [0.5035s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7815977Z 2025-12-04T09:31:25.7816019Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7816112Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7816204Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7816206Z 2025-12-04T09:31:25.7816252Z The failure occurred for item [2] 2025-12-04T09:31:25.7816254Z 2025-12-04T09:31:25.7816325Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7816471Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7816473Z 2025-12-04T09:31:25.7816589Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7816763Z FAILED [0.2502s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7816776Z 2025-12-04T09:31:25.7816821Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7816919Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7817016Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7817019Z 2025-12-04T09:31:25.7817063Z The failure occurred for item [2] 2025-12-04T09:31:25.7817064Z 2025-12-04T09:31:25.7817138Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7817279Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7817281Z 2025-12-04T09:31:25.7817368Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7817538Z FAILED [0.4919s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7817568Z 2025-12-04T09:31:25.7817611Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7817703Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7817794Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7817796Z 2025-12-04T09:31:25.7817837Z The failure occurred for item [2] 2025-12-04T09:31:25.7817839Z 2025-12-04T09:31:25.7817910Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7818052Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7818054Z 2025-12-04T09:31:25.7818141Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7818315Z FAILED [0.5330s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7818318Z 2025-12-04T09:31:25.7818361Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7818455Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7818546Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:31:25.7818547Z 2025-12-04T09:31:25.7818593Z The failure occurred for item [2] 2025-12-04T09:31:25.7818594Z 2025-12-04T09:31:25.7818665Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7818810Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7818812Z 2025-12-04T09:31:25.7818893Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7819066Z FAILED [0.2144s] inductor/test_torchinductor.py::GPUTests::test_var_mean_tile_reduction_True_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:31:25.7819068Z 2025-12-04T09:31:25.7819109Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:31:25.7819212Z Greatest absolute difference: 0.5851404070854187 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:31:25.7819310Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:31:25.7819312Z 2025-12-04T09:31:25.7819357Z The failure occurred for item [2] 2025-12-04T09:31:25.7819359Z 2025-12-04T09:31:25.7819429Z To execute this test, run the following from the base repo dir: 2025-12-04T09:31:25.7819573Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor.py GPUTests.test_var_mean_tile_reduction_True_cuda 2025-12-04T09:31:25.7819575Z 2025-12-04T09:31:25.7819660Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:31:25.7819754Z ======================= 34 failed, 216 passed in 57.63s ======================== 2025-12-04T09:31:25.7819756Z 2025-12-04T09:31:25.7819931Z FINISHED PRINTING LOG FILE of inductor/test_torchinductor 2/2 (test/test-reports/inductor.test_torchinductor_2.2_916af9a5c16d1706_.log) 2025-12-04T09:31:25.7819935Z 2025-12-04T09:31:25.7820053Z Finished inductor/test_torchinductor 2/2 ... [2025-12-04 09:31:25.693959][5634706.200344313], took 1.12min 2025-12-04T09:31:25.7820295Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:31:27.0448944Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:31:27.0457203Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T09:31:27.0457462Z Uploading artifacts took 0.00 seconds 2025-12-04T09:31:27.0496196Z inductor/test_torchinductor 2/2 failed! 2025-12-04T09:31:27.0496466Z Running inductor/test_torchinductor_dynamic_shapes 3/4 ... [2025-12-04 09:31:27.045889][5634707.552287904] 2025-12-04T09:31:27.0496705Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:31:27.0497197Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_dynamic_shapes.py', '--shard-id=3', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:31:27.046244] 2025-12-04T09:33:20.7826924Z 2025-12-04T09:33:20.7828697Z PRINTING LOG FILE of inductor/test_torchinductor_dynamic_shapes 3/4 (test/test-reports/inductor.test_torchinductor_dynamic_shapes_3.4_a03110cdf8deed71_.log) 2025-12-04T09:33:20.7829494Z Test results will be stored in test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-da9ca06781626cfc.xml 2025-12-04T09:33:20.7829963Z ============================= test session starts ============================== 2025-12-04T09:33:20.7833446Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T09:33:20.7833727Z cachedir: .pytest_cache 2025-12-04T09:33:20.7834037Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T09:33:20.7834313Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T09:33:20.7834449Z configfile: pytest.ini 2025-12-04T09:33:20.7834718Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T09:33:20.7835008Z collecting ... collected 1851 items 2025-12-04T09:33:20.7835172Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T09:33:20.7853158Z Running 100 items in this shard: test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.7870999Z 2025-12-04T09:33:20.7871283Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [2.0053s] [ 1%] 2025-12-04T09:33:20.7871805Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py FAILED [1.0620s] [ 2%] 2025-12-04T09:33:20.7872319Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8302s] [ 2%] 2025-12-04T09:33:20.7872905Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8982s] [ 2%] 2025-12-04T09:33:20.7873415Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8381s] [ 2%] 2025-12-04T09:33:20.7873929Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.0621s] [ 2%] 2025-12-04T09:33:20.7874437Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8242s] [ 2%] 2025-12-04T09:33:20.7874945Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.7914s] [ 2%] 2025-12-04T09:33:20.7875452Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8394s] [ 2%] 2025-12-04T09:33:20.7875994Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.7787s] [ 2%] 2025-12-04T09:33:20.7876568Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.7956s] [ 2%] 2025-12-04T09:33:20.7877075Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8290s] [ 2%] 2025-12-04T09:33:20.7877579Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8303s] [ 2%] 2025-12-04T09:33:20.7878091Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8170s] [ 2%] 2025-12-04T09:33:20.7878601Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8665s] [ 2%] 2025-12-04T09:33:20.7879109Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.8144s] [ 2%] 2025-12-04T09:33:20.7879624Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.7787s] [ 2%] 2025-12-04T09:33:20.7880134Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6579s] [ 2%] 2025-12-04T09:33:20.7880649Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6684s] [ 2%] 2025-12-04T09:33:20.7881157Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6664s] [ 2%] 2025-12-04T09:33:20.7881669Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6131s] [ 2%] 2025-12-04T09:33:20.7882174Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6474s] [ 2%] 2025-12-04T09:33:20.7882688Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6548s] [ 2%] 2025-12-04T09:33:20.7883245Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6218s] [ 2%] 2025-12-04T09:33:20.7883748Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6465s] [ 2%] 2025-12-04T09:33:20.7884253Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.0621s] [ 2%] 2025-12-04T09:33:20.7884760Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6605s] [ 2%] 2025-12-04T09:33:20.7885269Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6188s] [ 2%] 2025-12-04T09:33:20.7885777Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.5798s] [ 2%] 2025-12-04T09:33:20.7886324Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6517s] [ 2%] 2025-12-04T09:33:20.7886885Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6517s] [ 2%] 2025-12-04T09:33:20.7887390Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6190s] [ 2%] 2025-12-04T09:33:20.7887897Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6674s] [ 2%] 2025-12-04T09:33:20.7888403Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6120s] [ 2%] 2025-12-04T09:33:20.7888913Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6397s] [ 2%] 2025-12-04T09:33:20.7889426Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6490s] [ 2%] 2025-12-04T09:33:20.7889939Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6740s] [ 2%] 2025-12-04T09:33:20.7890447Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6312s] [ 2%] 2025-12-04T09:33:20.7890956Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.5891s] [ 2%] 2025-12-04T09:33:20.7891469Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6017s] [ 2%] 2025-12-04T09:33:20.7891981Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6303s] [ 2%] 2025-12-04T09:33:20.7892488Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6202s] [ 2%] 2025-12-04T09:33:20.7892993Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6538s] [ 2%] 2025-12-04T09:33:20.7893549Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6373s] [ 2%] 2025-12-04T09:33:20.7894054Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6329s] [ 2%] 2025-12-04T09:33:20.7894554Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6511s] [ 2%] 2025-12-04T09:33:20.7895060Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.1733s] [ 2%] 2025-12-04T09:33:20.7895566Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.7035s] [ 2%] 2025-12-04T09:33:20.7896168Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6467s] [ 2%] 2025-12-04T09:33:20.7896711Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6371s] [ 2%] 2025-12-04T09:33:20.7897215Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout_deterministic_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [0.6511s] [ 2%] 2025-12-04T09:33:20.7897728Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3389s] [ 2%] 2025-12-04T09:33:20.7898244Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py FAILED [1.3389s] [ 2%] 2025-12-04T09:33:20.7898756Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3277s] [ 2%] 2025-12-04T09:33:20.7899273Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3385s] [ 2%] 2025-12-04T09:33:20.7899785Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py FAILED [0.6341s] [ 2%] 2025-12-04T09:33:20.7900299Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2265s] [ 2%] 2025-12-04T09:33:20.7900813Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3158s] [ 2%] 2025-12-04T09:33:20.7901328Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2868s] [ 2%] 2025-12-04T09:33:20.7901845Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py FAILED [1.6697s] [ 2%] 2025-12-04T09:33:20.7902391Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py FAILED [0.6472s] [ 2%] 2025-12-04T09:33:20.7902900Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3787s] [ 2%] 2025-12-04T09:33:20.7903458Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2839s] [ 2%] 2025-12-04T09:33:20.7903975Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2920s] [ 2%] 2025-12-04T09:33:20.7904490Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2851s] [ 2%] 2025-12-04T09:33:20.7905004Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3220s] [ 2%] 2025-12-04T09:33:20.7905517Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3090s] [ 2%] 2025-12-04T09:33:20.7906123Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3107s] [ 2%] 2025-12-04T09:33:20.7906680Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3007s] [ 2%] 2025-12-04T09:33:20.7907189Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2944s] [ 2%] 2025-12-04T09:33:20.7907702Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3001s] [ 2%] 2025-12-04T09:33:20.7908217Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.6299s] [ 2%] 2025-12-04T09:33:20.7908731Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py FAILED [0.6402s] [ 2%] 2025-12-04T09:33:20.7909243Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py FAILED [1.2823s] [ 2%] 2025-12-04T09:33:20.7909758Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3265s] [ 2%] 2025-12-04T09:33:20.7910271Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3528s] [ 2%] 2025-12-04T09:33:20.7910783Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3426s] [ 2%] 2025-12-04T09:33:20.7911299Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2912s] [ 2%] 2025-12-04T09:33:20.7911819Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3202s] [ 2%] 2025-12-04T09:33:20.7912330Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3967s] [ 2%] 2025-12-04T09:33:20.7912841Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3376s] [ 2%] 2025-12-04T09:33:20.7913389Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3452s] [ 2%] 2025-12-04T09:33:20.7913906Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2568s] [ 2%] 2025-12-04T09:33:20.7914424Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.4342s] [ 2%] 2025-12-04T09:33:20.7914933Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.8312s] [ 2%] 2025-12-04T09:33:20.7915442Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.5084s] [ 2%] 2025-12-04T09:33:20.7916023Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py FAILED [0.8177s] [ 2%] 2025-12-04T09:33:20.7916569Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.6291s] [ 2%] 2025-12-04T09:33:20.7917087Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.5913s] [ 2%] 2025-12-04T09:33:20.7917598Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.5179s] [ 2%] 2025-12-04T09:33:20.7918111Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.5715s] [ 2%] 2025-12-04T09:33:20.7918622Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.6143s] [ 2%] 2025-12-04T09:33:20.7919134Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.4622s] [ 2%] 2025-12-04T09:33:20.7919647Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3423s] [ 2%] 2025-12-04T09:33:20.7920166Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3405s] [ 2%] 2025-12-04T09:33:20.7920678Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2844s] [ 2%] 2025-12-04T09:33:20.7921189Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2879s] [ 2%] 2025-12-04T09:33:20.7921706Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.9533s] [ 2%] 2025-12-04T09:33:20.7922217Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.3640s] [ 2%] 2025-12-04T09:33:20.7922729Z inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda <- test/inductor/test_torchinductor.py PASSED [1.2994s] [ 2%] 2025-12-04T09:33:20.7923009Z 2025-12-04T09:33:20.7923107Z =================================== FAILURES =================================== 2025-12-04T09:33:20.7923319Z _ DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda __ 2025-12-04T09:33:20.7923515Z Traceback (most recent call last): 2025-12-04T09:33:20.7923736Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:33:20.7923947Z self.common( 2025-12-04T09:33:20.7924114Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:33:20.7924291Z return func(*args, **kwds) 2025-12-04T09:33:20.7924500Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:33:20.7924709Z check_model( 2025-12-04T09:33:20.7924892Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:33:20.7925097Z assert_equal_fn( 2025-12-04T09:33:20.7925310Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:33:20.7925558Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:33:20.7925822Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:33:20.7966642Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:33:20.7966842Z AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.7966935Z 2025-12-04T09:33:20.7966989Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.7967180Z Greatest absolute difference: 0.5851404666900635 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.7967429Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:33:20.7967571Z 2025-12-04T09:33:20.7967624Z The failure occurred for item [2] 2025-12-04T09:33:20.7967704Z 2025-12-04T09:33:20.7967806Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.7968135Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.7968399Z 2025-12-04T09:33:20.7968499Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.7968717Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.7968898Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.7969100Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.7969604Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.7970062Z graph_break [] 2025-12-04T09:33:20.7970230Z _ DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda __ 2025-12-04T09:33:20.7970422Z Traceback (most recent call last): 2025-12-04T09:33:20.7970634Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:33:20.7970846Z self.common( 2025-12-04T09:33:20.7971006Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:33:20.7971180Z return func(*args, **kwds) 2025-12-04T09:33:20.7971389Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 725, in check_model_gpu 2025-12-04T09:33:20.7971589Z check_model( 2025-12-04T09:33:20.7971766Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:33:20.7971965Z assert_equal_fn( 2025-12-04T09:33:20.7972168Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:33:20.7972637Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:33:20.7972898Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:33:20.7973179Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:33:20.7973346Z AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.7973439Z 2025-12-04T09:33:20.7973488Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.7973674Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.7973903Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:33:20.7974038Z 2025-12-04T09:33:20.7974087Z The failure occurred for item [2] 2025-12-04T09:33:20.7974169Z 2025-12-04T09:33:20.7974244Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.7974569Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.7974822Z 2025-12-04T09:33:20.7974913Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.7975250Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.7975427Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.7975626Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.7976161Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.7976588Z graph_break [] 2025-12-04T09:33:20.7976724Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.7976906Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.7977107Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.7977605Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.7978033Z graph_break [] 2025-12-04T09:33:20.7978167Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.7978345Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.7978546Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.7979035Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.7979465Z graph_break [] 2025-12-04T09:33:20.7979621Z _ DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda __ 2025-12-04T09:33:20.7979812Z Traceback (most recent call last): 2025-12-04T09:33:20.7980021Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:33:20.7980226Z self.common( 2025-12-04T09:33:20.7980378Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:33:20.7980553Z return func(*args, **kwds) 2025-12-04T09:33:20.7980753Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:33:20.7980959Z check_model( 2025-12-04T09:33:20.7981209Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:33:20.7981407Z assert_equal_fn( 2025-12-04T09:33:20.7981615Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:33:20.7981856Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:33:20.7982115Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:33:20.7982389Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:33:20.7982555Z AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.7982648Z 2025-12-04T09:33:20.7982695Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.7982879Z Greatest absolute difference: 0.5851404666900635 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.7983122Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:33:20.7983266Z 2025-12-04T09:33:20.7983315Z The failure occurred for item [2] 2025-12-04T09:33:20.7983398Z 2025-12-04T09:33:20.7983472Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.7983794Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.7984081Z 2025-12-04T09:33:20.7984171Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.7984377Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.7984550Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.7984750Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.7985243Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.7985664Z graph_break [] 2025-12-04T09:33:20.7985800Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.7986058Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.7986257Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.7986747Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.7987174Z graph_break [] 2025-12-04T09:33:20.7987308Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.7987484Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.7987682Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.7988173Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.7988603Z graph_break [] 2025-12-04T09:33:20.7988738Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.7988918Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.7989114Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.7989641Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.7990066Z graph_break [] 2025-12-04T09:33:20.7990198Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.7990374Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.7990570Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.7991058Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.7991481Z graph_break [] 2025-12-04T09:33:20.7991614Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.7991789Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.7991989Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.7992480Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.7992942Z graph_break [] 2025-12-04T09:33:20.7993101Z _ DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda __ 2025-12-04T09:33:20.7993296Z Traceback (most recent call last): 2025-12-04T09:33:20.7993510Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:33:20.7993711Z self.common( 2025-12-04T09:33:20.7993863Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:33:20.7994037Z return func(*args, **kwds) 2025-12-04T09:33:20.7994238Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 725, in check_model_gpu 2025-12-04T09:33:20.7994442Z check_model( 2025-12-04T09:33:20.7994617Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:33:20.7994816Z assert_equal_fn( 2025-12-04T09:33:20.7995019Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:33:20.7995258Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:33:20.7995524Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:33:20.7995791Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:33:20.7996014Z AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.7996103Z 2025-12-04T09:33:20.7996148Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.7996323Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.8005998Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:33:20.8006154Z 2025-12-04T09:33:20.8006205Z The failure occurred for item [2] 2025-12-04T09:33:20.8006296Z 2025-12-04T09:33:20.8006375Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.8006713Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.8006971Z 2025-12-04T09:33:20.8007064Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.8007279Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8007458Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8007660Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8008244Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8008678Z graph_break [] 2025-12-04T09:33:20.8008822Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8009004Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8009207Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8009705Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8010138Z graph_break [] 2025-12-04T09:33:20.8010280Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8010465Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8010665Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8011192Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8011622Z graph_break [] 2025-12-04T09:33:20.8011761Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8011941Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8012143Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8012638Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8013076Z graph_break [] 2025-12-04T09:33:20.8013209Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8013390Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8013590Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8014086Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8014526Z graph_break [] 2025-12-04T09:33:20.8014667Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8014840Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8015041Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8015538Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8016017Z graph_break [] 2025-12-04T09:33:20.8016155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8016338Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8016534Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8017073Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8017502Z graph_break [] 2025-12-04T09:33:20.8017640Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8017819Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8018021Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8018516Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8018950Z graph_break [] 2025-12-04T09:33:20.8019090Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8019271Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8019473Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8020003Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8020432Z graph_break [] 2025-12-04T09:33:20.8020560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8020739Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8020934Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8021419Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8021845Z graph_break [] 2025-12-04T09:33:20.8022004Z _ DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda __ 2025-12-04T09:33:20.8022195Z Traceback (most recent call last): 2025-12-04T09:33:20.8022409Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:33:20.8022616Z self.common( 2025-12-04T09:33:20.8022768Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:33:20.8022945Z return func(*args, **kwds) 2025-12-04T09:33:20.8023150Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:33:20.8023356Z check_model( 2025-12-04T09:33:20.8023536Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:33:20.8023733Z assert_equal_fn( 2025-12-04T09:33:20.8023940Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:33:20.8024185Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:33:20.8024449Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:33:20.8024727Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:33:20.8024894Z AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.8024984Z 2025-12-04T09:33:20.8025037Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.8025222Z Greatest absolute difference: 0.5851404666900635 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.8025524Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:33:20.8025666Z 2025-12-04T09:33:20.8025719Z The failure occurred for item [2] 2025-12-04T09:33:20.8025798Z 2025-12-04T09:33:20.8025878Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.8026269Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.8026515Z 2025-12-04T09:33:20.8026610Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.8026816Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8026992Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8027189Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8027684Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8028174Z graph_break [] 2025-12-04T09:33:20.8028373Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8028550Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8028746Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8029236Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8029664Z graph_break [] 2025-12-04T09:33:20.8029798Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8029974Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8030172Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8030658Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8031083Z graph_break [] 2025-12-04T09:33:20.8031215Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8031387Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8031580Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8032067Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8032489Z graph_break [] 2025-12-04T09:33:20.8032625Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8032801Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8032997Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8033484Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8033908Z graph_break [] 2025-12-04T09:33:20.8034043Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8034258Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8034455Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8034940Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8035363Z graph_break [] 2025-12-04T09:33:20.8035499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8035675Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8035873Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8036408Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8036833Z graph_break [] 2025-12-04T09:33:20.8036995Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8037169Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8037365Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8037848Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8038274Z graph_break [] 2025-12-04T09:33:20.8038408Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8038583Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8038778Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8039268Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8039694Z graph_break [] 2025-12-04T09:33:20.8039829Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8040005Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8040201Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8040685Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8041112Z graph_break [] 2025-12-04T09:33:20.8041246Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8041423Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8041616Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8042099Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8042517Z graph_break [] 2025-12-04T09:33:20.8042674Z _ DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda __ 2025-12-04T09:33:20.8042899Z Traceback (most recent call last): 2025-12-04T09:33:20.8043110Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:33:20.8043313Z self.common( 2025-12-04T09:33:20.8043463Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:33:20.8043635Z return func(*args, **kwds) 2025-12-04T09:33:20.8043837Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:33:20.8044043Z check_model( 2025-12-04T09:33:20.8044218Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:33:20.8044418Z assert_equal_fn( 2025-12-04T09:33:20.8044622Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:33:20.8044864Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:33:20.8045126Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:33:20.8045401Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:33:20.8045590Z AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.8045680Z 2025-12-04T09:33:20.8045732Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.8045916Z Greatest absolute difference: 0.5851404666900635 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.8046230Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:33:20.8046372Z 2025-12-04T09:33:20.8046423Z The failure occurred for item [2] 2025-12-04T09:33:20.8046503Z 2025-12-04T09:33:20.8046583Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.8046905Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.8047156Z 2025-12-04T09:33:20.8047250Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.8047453Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8047628Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8047825Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8048323Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8048748Z graph_break [] 2025-12-04T09:33:20.8048884Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8049060Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8049258Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8049743Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8050170Z graph_break [] 2025-12-04T09:33:20.8050302Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8050479Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8050675Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8051207Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8051631Z graph_break [] 2025-12-04T09:33:20.8051764Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8051942Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8052139Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8052628Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8053058Z graph_break [] 2025-12-04T09:33:20.8053192Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8053368Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8053565Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8054052Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8054510Z graph_break [] 2025-12-04T09:33:20.8054645Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8054818Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8055014Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8055498Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8055956Z graph_break [] 2025-12-04T09:33:20.8056091Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8056265Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8056461Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8056950Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8057374Z graph_break [] 2025-12-04T09:33:20.8057506Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8057681Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8057879Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8058363Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8058786Z graph_break [] 2025-12-04T09:33:20.8058917Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8059090Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8059286Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8059803Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8060227Z graph_break [] 2025-12-04T09:33:20.8060360Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8060534Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8060731Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8061223Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8061651Z graph_break [] 2025-12-04T09:33:20.8061783Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8061955Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8062154Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8062638Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8063092Z graph_break [] 2025-12-04T09:33:20.8063225Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8063401Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8063596Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8064078Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8064494Z graph_break [] 2025-12-04T09:33:20.8064618Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8064787Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8064973Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8065454Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8065871Z graph_break [] 2025-12-04T09:33:20.8066067Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8066233Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8066422Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8066903Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8067327Z graph_break [] 2025-12-04T09:33:20.8067453Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8067621Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8067809Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8068331Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8068747Z graph_break [] 2025-12-04T09:33:20.8068871Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8069037Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8069228Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8069703Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8070121Z graph_break [] 2025-12-04T09:33:20.8070244Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8070409Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8070598Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8071073Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8071517Z graph_break [] 2025-12-04T09:33:20.8071640Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8071807Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8071995Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8072475Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8072892Z graph_break [] 2025-12-04T09:33:20.8073014Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8073185Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8073374Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8073854Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8074269Z graph_break [] 2025-12-04T09:33:20.8074393Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8074559Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8074746Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8075221Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8075640Z graph_break [] 2025-12-04T09:33:20.8075766Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8075980Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8076168Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8076684Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8077100Z graph_break [] 2025-12-04T09:33:20.8077225Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8077396Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8077582Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8078058Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8078479Z graph_break [] 2025-12-04T09:33:20.8078605Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8078768Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8078958Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8079432Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8079873Z graph_break [] 2025-12-04T09:33:20.8080023Z _ DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda __ 2025-12-04T09:33:20.8080203Z Traceback (most recent call last): 2025-12-04T09:33:20.8080406Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:33:20.8080603Z self.common( 2025-12-04T09:33:20.8080744Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:33:20.8080908Z return func(*args, **kwds) 2025-12-04T09:33:20.8081105Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 725, in check_model_gpu 2025-12-04T09:33:20.8081303Z check_model( 2025-12-04T09:33:20.8081471Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:33:20.8081662Z assert_equal_fn( 2025-12-04T09:33:20.8081859Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:33:20.8082092Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:33:20.8082342Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:33:20.8082612Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:33:20.8082771Z AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.8082859Z 2025-12-04T09:33:20.8082905Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.8083076Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.8083297Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:33:20.8083422Z 2025-12-04T09:33:20.8083469Z The failure occurred for item [2] 2025-12-04T09:33:20.8083546Z 2025-12-04T09:33:20.8083617Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.8083931Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.8084175Z 2025-12-04T09:33:20.8084262Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.8084457Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8084619Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8084809Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8085317Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8085729Z graph_break [] 2025-12-04T09:33:20.8085854Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8086077Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8086266Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8086746Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8087165Z graph_break [] 2025-12-04T09:33:20.8087291Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8087456Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8087643Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8088162Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8088575Z graph_break [] 2025-12-04T09:33:20.8088699Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8088862Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8089049Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8089526Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8089951Z graph_break [] 2025-12-04T09:33:20.8090075Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8090240Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8090427Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8090907Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8091321Z graph_break [] 2025-12-04T09:33:20.8091447Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8091612Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8091800Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8092277Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8092688Z graph_break [] 2025-12-04T09:33:20.8092814Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8092981Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8093167Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8093680Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8094098Z graph_break [] 2025-12-04T09:33:20.8094224Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8094398Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8094593Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8095078Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8095505Z graph_break [] 2025-12-04T09:33:20.8095639Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8095813Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8096058Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8096579Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8097008Z graph_break [] 2025-12-04T09:33:20.8097141Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8097313Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8097509Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8097994Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8098419Z graph_break [] 2025-12-04T09:33:20.8098550Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8098723Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8098919Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8099401Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8099820Z graph_break [] 2025-12-04T09:33:20.8099954Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8100129Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8100324Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8100814Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8101244Z graph_break [] 2025-12-04T09:33:20.8101379Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8101552Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8101748Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8102268Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8102695Z graph_break [] 2025-12-04T09:33:20.8102828Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8103002Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8103197Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8103680Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8104103Z graph_break [] 2025-12-04T09:33:20.8104237Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8104411Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8104606Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8105113Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8105542Z graph_break [] 2025-12-04T09:33:20.8105671Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8105845Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8106100Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8106585Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8107019Z graph_break [] 2025-12-04T09:33:20.8107148Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8107322Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8107517Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8108009Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8108434Z graph_break [] 2025-12-04T09:33:20.8108568Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8108743Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8108933Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8109428Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8109852Z graph_break [] 2025-12-04T09:33:20.8109985Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8110159Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8110356Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8110877Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8111303Z graph_break [] 2025-12-04T09:33:20.8111436Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8111611Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8111808Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8112296Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8112725Z graph_break [] 2025-12-04T09:33:20.8112861Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8113037Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8113231Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8113755Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8114181Z graph_break [] 2025-12-04T09:33:20.8114316Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8114493Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8114692Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8115181Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8115610Z graph_break [] 2025-12-04T09:33:20.8115741Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8115914Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8116161Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8116644Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8117067Z graph_break [] 2025-12-04T09:33:20.8117201Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8117376Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8117571Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8118058Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8118486Z graph_break [] 2025-12-04T09:33:20.8118646Z _ DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda __ 2025-12-04T09:33:20.8118838Z Traceback (most recent call last): 2025-12-04T09:33:20.8119050Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 5770, in test_var_mean 2025-12-04T09:33:20.8119254Z self.common( 2025-12-04T09:33:20.8119448Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 79, in inner 2025-12-04T09:33:20.8119624Z return func(*args, **kwds) 2025-12-04T09:33:20.8119827Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 692, in check_model_gpu 2025-12-04T09:33:20.8120035Z check_model( 2025-12-04T09:33:20.8120213Z File "/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py", line 566, in check_model 2025-12-04T09:33:20.8120409Z assert_equal_fn( 2025-12-04T09:33:20.8120604Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T09:33:20.8120841Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T09:33:20.8121100Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T09:33:20.8121373Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T09:33:20.8121537Z AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.8121631Z 2025-12-04T09:33:20.8121678Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.8121864Z Greatest absolute difference: 0.5851404666900635 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.8122143Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:33:20.8122287Z 2025-12-04T09:33:20.8122335Z The failure occurred for item [2] 2025-12-04T09:33:20.8122419Z 2025-12-04T09:33:20.8122493Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.8122814Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.8123064Z 2025-12-04T09:33:20.8123154Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.8123357Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8123530Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8123726Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8124215Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8124644Z graph_break [] 2025-12-04T09:33:20.8124773Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8124951Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8125148Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8125639Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8126121Z graph_break [] 2025-12-04T09:33:20.8126254Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8126426Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8126620Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8127109Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8127534Z graph_break [] 2025-12-04T09:33:20.8127665Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8127884Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8128074Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8128563Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8128988Z graph_break [] 2025-12-04T09:33:20.8129123Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8129298Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8129494Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8129979Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8130405Z graph_break [] 2025-12-04T09:33:20.8130539Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8130740Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8130942Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8131425Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8131847Z graph_break [] 2025-12-04T09:33:20.8131981Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8132156Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8132352Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8132836Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8133258Z graph_break [] 2025-12-04T09:33:20.8133391Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8133565Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8133759Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8134250Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8134680Z graph_break [] 2025-12-04T09:33:20.8134813Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8134992Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8135187Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8135672Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8136127Z graph_break [] 2025-12-04T09:33:20.8136261Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8136466Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8136662Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8137145Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8137570Z graph_break [] 2025-12-04T09:33:20.8137701Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8137874Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8138067Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8138555Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8138977Z graph_break [] 2025-12-04T09:33:20.8139109Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8139310Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8139504Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8139986Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8140410Z graph_break [] 2025-12-04T09:33:20.8140542Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8140718Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8140916Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8141404Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8141833Z graph_break [] 2025-12-04T09:33:20.8141966Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8142141Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8142339Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8142823Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8143253Z graph_break [] 2025-12-04T09:33:20.8143386Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8143564Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8143761Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8144249Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8144677Z graph_break [] 2025-12-04T09:33:20.8144809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8145010Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8145206Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8145687Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8146174Z graph_break [] 2025-12-04T09:33:20.8146306Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8146480Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8146674Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8147159Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8147582Z graph_break [] 2025-12-04T09:33:20.8147715Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8147926Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8148123Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8148608Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8149037Z graph_break [] 2025-12-04T09:33:20.8149169Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8149344Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8149539Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8150027Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8150456Z graph_break [] 2025-12-04T09:33:20.8150589Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8150764Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8150959Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8151443Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8151866Z graph_break [] 2025-12-04T09:33:20.8151995Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8152169Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8152362Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8152845Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8153267Z graph_break [] 2025-12-04T09:33:20.8153399Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8153572Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8153806Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8154290Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8154714Z graph_break [] 2025-12-04T09:33:20.8154846Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8155018Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8155213Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8155695Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8156179Z graph_break [] 2025-12-04T09:33:20.8156311Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8156522Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8156717Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8157201Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8157626Z graph_break [] 2025-12-04T09:33:20.8157759Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8157937Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8158138Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8158624Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8159052Z graph_break [] 2025-12-04T09:33:20.8159185Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8159366Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8159567Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8160055Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8160481Z graph_break [] 2025-12-04T09:33:20.8160618Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8160799Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8160998Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8161488Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8161912Z graph_break [] 2025-12-04T09:33:20.8162045Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8162222Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8162453Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8162935Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8163355Z graph_break [] 2025-12-04T09:33:20.8163484Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8163653Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8163843Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8164321Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8164741Z graph_break [] 2025-12-04T09:33:20.8164869Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8165067Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8165260Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8165739Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8166207Z graph_break [] 2025-12-04T09:33:20.8166334Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8166504Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8166696Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8167176Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8167602Z graph_break [] 2025-12-04T09:33:20.8167733Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8167904Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8168094Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8168576Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8168993Z graph_break [] 2025-12-04T09:33:20.8169120Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8169294Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8169484Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8169963Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8170387Z graph_break [] 2025-12-04T09:33:20.8170511Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8170681Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8170907Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8171386Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8171808Z graph_break [] 2025-12-04T09:33:20.8171934Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8172102Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8172290Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8172775Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8173194Z graph_break [] 2025-12-04T09:33:20.8173322Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8173520Z stats [('calls_captured', 12), ('unique_graphs', 2)] 2025-12-04T09:33:20.8173707Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:33:20.8174188Z inductor [('triton_bundler_save_kernel', 112), ('benchmarking.InductorBenchmarker.benchmark', 12), ('benchmarking.InductorBenchmarker.benchmark_gpu', 12), ('async_compile_cache_miss', 8), ('async_compile_cache_hit', 4), ('fxgraph_cache_miss', 2), ('triton_bundler_save_static_autotuner', 2)] 2025-12-04T09:33:20.8174612Z graph_break [] 2025-12-04T09:33:20.8174741Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:33:20.8174910Z stats [('calls_captured', 6), ('unique_graphs', 1)] 2025-12-04T09:33:20.8175103Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:33:20.8175581Z inductor [('triton_bundler_save_kernel', 56), ('benchmarking.InductorBenchmarker.benchmark', 6), ('benchmarking.InductorBenchmarker.benchmark_gpu', 6), ('async_compile_cache_miss', 4), ('async_compile_cache_hit', 2), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:33:20.8176032Z graph_break [] 2025-12-04T09:33:20.8176352Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-da9ca06781626cfc.xml - 2025-12-04T09:33:20.8176712Z =========================== short test summary info ============================ 2025-12-04T09:33:20.8177080Z FAILED [1.0620s] inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.8177352Z 2025-12-04T09:33:20.8177401Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.8177582Z Greatest absolute difference: 0.5851404666900635 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.8177824Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:33:20.8177962Z 2025-12-04T09:33:20.8178012Z The failure occurred for item [2] 2025-12-04T09:33:20.8178089Z 2025-12-04T09:33:20.8178168Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.8178488Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.8178733Z 2025-12-04T09:33:20.8178826Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.8179218Z FAILED [1.3389s] inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.8179488Z 2025-12-04T09:33:20.8179538Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.8179712Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.8179936Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:33:20.8180067Z 2025-12-04T09:33:20.8180115Z The failure occurred for item [2] 2025-12-04T09:33:20.8180191Z 2025-12-04T09:33:20.8180267Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.8180582Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.8180827Z 2025-12-04T09:33:20.8180916Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.8181278Z FAILED [0.6341s] inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.8181550Z 2025-12-04T09:33:20.8181595Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.8181809Z Greatest absolute difference: 0.5851404666900635 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.8182045Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:33:20.8182182Z 2025-12-04T09:33:20.8182228Z The failure occurred for item [2] 2025-12-04T09:33:20.8182303Z 2025-12-04T09:33:20.8182378Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.8182698Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.8182943Z 2025-12-04T09:33:20.8183029Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.8183388Z FAILED [1.6697s] inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.8183660Z 2025-12-04T09:33:20.8183707Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.8183878Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.8184100Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:33:20.8184233Z 2025-12-04T09:33:20.8184279Z The failure occurred for item [2] 2025-12-04T09:33:20.8184358Z 2025-12-04T09:33:20.8184431Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.8184743Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.8184991Z 2025-12-04T09:33:20.8185079Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.8185438Z FAILED [0.6472s] inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.8185709Z 2025-12-04T09:33:20.8185754Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.8186003Z Greatest absolute difference: 0.5851404666900635 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.8186238Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:33:20.8186376Z 2025-12-04T09:33:20.8186421Z The failure occurred for item [2] 2025-12-04T09:33:20.8186499Z 2025-12-04T09:33:20.8186571Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.8186884Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.8187128Z 2025-12-04T09:33:20.8187246Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.8187608Z FAILED [0.6402s] inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.8187881Z 2025-12-04T09:33:20.8187926Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.8188100Z Greatest absolute difference: 0.5851404666900635 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.8188336Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:33:20.8188475Z 2025-12-04T09:33:20.8188519Z The failure occurred for item [2] 2025-12-04T09:33:20.8188597Z 2025-12-04T09:33:20.8188669Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.8188984Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.8189229Z 2025-12-04T09:33:20.8189314Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.8189669Z FAILED [1.2823s] inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.8189967Z 2025-12-04T09:33:20.8190013Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.8190179Z Greatest absolute difference: 0.58544921875 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.8190402Z Greatest relative difference: 0.57568359375 at index (0, 1) (up to 0.001 allowed) 2025-12-04T09:33:20.8190532Z 2025-12-04T09:33:20.8190577Z The failure occurred for item [2] 2025-12-04T09:33:20.8190654Z 2025-12-04T09:33:20.8190726Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.8191039Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.8191284Z 2025-12-04T09:33:20.8191371Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.8191732Z FAILED [0.8177s] inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_var_mean_tile_reduction_True_dynamic_shapes_cuda - AssertionError: Tensor-likes are not close! 2025-12-04T09:33:20.8192005Z 2025-12-04T09:33:20.8192052Z Mismatched elements: 4 / 4 (100.0%) 2025-12-04T09:33:20.8192226Z Greatest absolute difference: 0.5851404666900635 at index (0, 3) (up to 1e-05 allowed) 2025-12-04T09:33:20.8196538Z Greatest relative difference: 0.5756681561470032 at index (0, 1) (up to 1.3e-06 allowed) 2025-12-04T09:33:20.8196682Z 2025-12-04T09:33:20.8196729Z The failure occurred for item [2] 2025-12-04T09:33:20.8196809Z 2025-12-04T09:33:20.8196883Z To execute this test, run the following from the base repo dir: 2025-12-04T09:33:20.8197211Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesGPUTests.test_var_mean_tile_reduction_True_dynamic_shapes_cuda 2025-12-04T09:33:20.8197461Z 2025-12-04T09:33:20.8197548Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:33:20.8197744Z =================== 8 failed, 92 passed in 105.64s (0:01:45) =================== 2025-12-04T09:33:20.8197851Z 2025-12-04T09:33:20.8198067Z FINISHED PRINTING LOG FILE of inductor/test_torchinductor_dynamic_shapes 3/4 (test/test-reports/inductor.test_torchinductor_dynamic_shapes_3.4_a03110cdf8deed71_.log) 2025-12-04T09:33:20.8198317Z 2025-12-04T09:33:20.8198458Z Finished inductor/test_torchinductor_dynamic_shapes 3/4 ... [2025-12-04 09:33:20.786140][5634821.292535432], took 1.90min 2025-12-04T09:33:20.8198873Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:33:20.8791104Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:33:20.8810368Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T09:33:20.8810566Z Uploading artifacts took 0.00 seconds 2025-12-04T09:33:20.8810718Z inductor/test_torchinductor_dynamic_shapes 3/4 failed! 2025-12-04T09:33:20.8813550Z Running inductor/test_kernel_benchmark 1/1 ... [2025-12-04 09:33:20.881257][5634821.38765462] 2025-12-04T09:33:20.8813758Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:33:20.8817084Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_kernel_benchmark.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:33:20.881588] 2025-12-04T09:42:50.1839000Z 2025-12-04T09:42:50.1840245Z PRINTING LOG FILE of inductor/test_kernel_benchmark 1/1 (test/test-reports/inductor.test_kernel_benchmark_1.1_80b5ed88f0cc4a76_.log) 2025-12-04T09:42:50.1841710Z Test results will be stored in test-reports/python-pytest/inductor.test_kernel_benchmark/inductor.test_kernel_benchmark-05a8c0c9d49884d6.xml 2025-12-04T09:42:50.1842662Z ============================= test session starts ============================== 2025-12-04T09:42:50.1844134Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T09:42:50.1844774Z cachedir: .pytest_cache 2025-12-04T09:42:50.1845540Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T09:42:50.1846482Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T09:42:50.1846894Z configfile: pytest.ini 2025-12-04T09:42:50.1847653Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T09:42:50.1848461Z collecting ... collected 18 items 2025-12-04T09:42:50.1848948Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T09:42:50.1890339Z Running 100 items in this shard: test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark, test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark 2025-12-04T09:42:50.1931093Z 2025-12-04T09:42:50.1931581Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [8.5619s] [ 1%] 2025-12-04T09:42:50.1932596Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [5.1842s] [ 2%] 2025-12-04T09:42:50.1933590Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.9195s] [ 2%] 2025-12-04T09:42:50.1934596Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.2833s] [ 2%] 2025-12-04T09:42:50.1935590Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.6054s] [ 2%] 2025-12-04T09:42:50.1936661Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.2612s] [ 2%] 2025-12-04T09:42:50.1937652Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.4016s] [ 2%] 2025-12-04T09:42:50.1938749Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.3537s] [ 2%] 2025-12-04T09:42:50.1939743Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.3196s] [ 2%] 2025-12-04T09:42:50.1940749Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.3348s] [ 2%] 2025-12-04T09:42:50.1941744Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.3309s] [ 2%] 2025-12-04T09:42:50.1942745Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.2929s] [ 2%] 2025-12-04T09:42:50.1943738Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.7802s] [ 2%] 2025-12-04T09:42:50.1944735Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.2509s] [ 2%] 2025-12-04T09:42:50.1945735Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.3989s] [ 2%] 2025-12-04T09:42:50.1946783Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.1102s] [ 2%] 2025-12-04T09:42:50.1947777Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.1197s] [ 2%] 2025-12-04T09:42:50.1948854Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.3206s] [ 2%] 2025-12-04T09:42:50.1949849Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [7.0075s] [ 2%] 2025-12-04T09:42:50.1950852Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [5.9755s] [ 2%] 2025-12-04T09:42:50.1951854Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.3557s] [ 2%] 2025-12-04T09:42:50.1952849Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.6676s] [ 2%] 2025-12-04T09:42:50.1953850Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.3867s] [ 2%] 2025-12-04T09:42:50.1954842Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.4149s] [ 2%] 2025-12-04T09:42:50.1955844Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.2842s] [ 2%] 2025-12-04T09:42:50.1956879Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.4148s] [ 2%] 2025-12-04T09:42:50.1957871Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.3854s] [ 2%] 2025-12-04T09:42:50.1958862Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.3156s] [ 2%] 2025-12-04T09:42:50.1959857Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.5150s] [ 2%] 2025-12-04T09:42:50.1960856Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.1123s] [ 2%] 2025-12-04T09:42:50.1961857Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.1010s] [ 2%] 2025-12-04T09:42:50.1962857Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [5.9982s] [ 2%] 2025-12-04T09:42:50.1963862Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [5.7655s] [ 2%] 2025-12-04T09:42:50.1964853Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.1213s] [ 2%] 2025-12-04T09:42:50.1965851Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.1376s] [ 2%] 2025-12-04T09:42:50.1966888Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [5.9382s] [ 2%] 2025-12-04T09:42:50.1967975Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [5.7718s] [ 2%] 2025-12-04T09:42:50.1968974Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.0401s] [ 2%] 2025-12-04T09:42:50.1969972Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [5.9718s] [ 2%] 2025-12-04T09:42:50.1970967Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.3496s] [ 2%] 2025-12-04T09:42:50.1971960Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [5.8630s] [ 2%] 2025-12-04T09:42:50.1972953Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.0051s] [ 2%] 2025-12-04T09:42:50.1973956Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [5.9086s] [ 2%] 2025-12-04T09:42:50.1974959Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.0303s] [ 2%] 2025-12-04T09:42:50.1976019Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [5.8077s] [ 2%] 2025-12-04T09:42:50.1977023Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [5.8919s] [ 2%] 2025-12-04T09:42:50.1978100Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.4015s] [ 2%] 2025-12-04T09:42:50.1979092Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.1720s] [ 2%] 2025-12-04T09:42:50.1980090Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.3834s] [ 2%] 2025-12-04T09:42:50.1981079Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [6.0117s] [ 2%] 2025-12-04T09:42:50.1982072Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark PASSED [5.9303s] [ 2%] 2025-12-04T09:42:50.1983058Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [4.8054s] [ 2%] 2025-12-04T09:42:50.1984012Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [5.0875s] [ 2%] 2025-12-04T09:42:50.1984966Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [5.2671s] [ 2%] 2025-12-04T09:42:50.1985909Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [5.1959s] [ 2%] 2025-12-04T09:42:50.1986947Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [4.8273s] [ 2%] 2025-12-04T09:42:50.1987891Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [4.8356s] [ 2%] 2025-12-04T09:42:50.1988834Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.9921s] [ 2%] 2025-12-04T09:42:50.1989787Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [4.9103s] [ 2%] 2025-12-04T09:42:50.1990729Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.9567s] [ 2%] 2025-12-04T09:42:50.1991671Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.7513s] [ 2%] 2025-12-04T09:42:50.1992618Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.7447s] [ 2%] 2025-12-04T09:42:50.1993561Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.7427s] [ 2%] 2025-12-04T09:42:50.1994501Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.8499s] [ 2%] 2025-12-04T09:42:50.1995447Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.6457s] [ 2%] 2025-12-04T09:42:50.1996453Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.7916s] [ 2%] 2025-12-04T09:42:50.1997488Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.7264s] [ 2%] 2025-12-04T09:42:50.1998434Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [4.8543s] [ 2%] 2025-12-04T09:42:50.1999383Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [4.9731s] [ 2%] 2025-12-04T09:42:50.2000327Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [5.3472s] [ 2%] 2025-12-04T09:42:50.2001272Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [4.9436s] [ 2%] 2025-12-04T09:42:50.2002219Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [5.2033s] [ 2%] 2025-12-04T09:42:50.2003164Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [4.9421s] [ 2%] 2025-12-04T09:42:50.2004111Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [5.0374s] [ 2%] 2025-12-04T09:42:50.2005060Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [4.9823s] [ 2%] 2025-12-04T09:42:50.2006060Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [5.1733s] [ 2%] 2025-12-04T09:42:50.2007092Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [4.9936s] [ 2%] 2025-12-04T09:42:50.2008035Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [5.1013s] [ 2%] 2025-12-04T09:42:50.2008979Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.9272s] [ 2%] 2025-12-04T09:42:50.2009920Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [5.1649s] [ 2%] 2025-12-04T09:42:50.2010863Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [5.2223s] [ 2%] 2025-12-04T09:42:50.2011812Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [5.1034s] [ 2%] 2025-12-04T09:42:50.2012755Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [5.0171s] [ 2%] 2025-12-04T09:42:50.2013695Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [4.7890s] [ 2%] 2025-12-04T09:42:50.2014641Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [4.8791s] [ 2%] 2025-12-04T09:42:50.2015587Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [5.3050s] [ 2%] 2025-12-04T09:42:50.2017755Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [5.0239s] [ 2%] 2025-12-04T09:42:50.2018701Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.9151s] [ 2%] 2025-12-04T09:42:50.2019642Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [5.0326s] [ 2%] 2025-12-04T09:42:50.2020593Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [4.9146s] [ 2%] 2025-12-04T09:42:50.2021535Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [5.0198s] [ 2%] 2025-12-04T09:42:50.2022484Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark FAILED [5.1448s] [ 2%] 2025-12-04T09:42:50.2023431Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [5.1893s] [ 2%] 2025-12-04T09:42:50.2024382Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [5.1722s] [ 2%] 2025-12-04T09:42:50.2025330Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.6280s] [ 2%] 2025-12-04T09:42:50.2026361Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.5194s] [ 2%] 2025-12-04T09:42:50.2027416Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.5521s] [ 2%] 2025-12-04T09:42:50.2028366Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.6098s] [ 2%] 2025-12-04T09:42:50.2029316Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.6323s] [ 2%] 2025-12-04T09:42:50.2030271Z inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark PASSED [4.6004s] [ 2%] 2025-12-04T09:42:50.2030817Z 2025-12-04T09:42:50.2031011Z =================================== FAILURES =================================== 2025-12-04T09:42:50.2031611Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2032173Z Traceback (most recent call last): 2025-12-04T09:42:50.2032922Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2033667Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2034390Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2035098Z ).run(bench_out) 2025-12-04T09:42:50.2035526Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2036053Z None UNK cufi44wmvf 2025-12-04T09:42:50.2036855Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2037899Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2038661Z ~~~~ <--- HERE 2025-12-04T09:42:50.2039403Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2040125Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2040334Z 2025-12-04T09:42:50.2040340Z 2025-12-04T09:42:50.2040595Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2041462Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2042084Z 2025-12-04T09:42:50.2042381Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2043051Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2043570Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2043991Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2044640Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2046321Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2047727Z graph_break [] 2025-12-04T09:42:50.2048159Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2049009Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2049884Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2050444Z Traceback (most recent call last): 2025-12-04T09:42:50.2051170Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2051911Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2052619Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2053315Z ).run(bench_out) 2025-12-04T09:42:50.2053729Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2054196Z None UNK cufi44wmvf 2025-12-04T09:42:50.2054967Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2056045Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2056805Z ~~~~ <--- HERE 2025-12-04T09:42:50.2057534Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2058255Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2058461Z 2025-12-04T09:42:50.2058466Z 2025-12-04T09:42:50.2058712Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2059566Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2060182Z 2025-12-04T09:42:50.2060480Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2061130Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2061721Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2062140Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2062775Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2064378Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2065761Z graph_break [] 2025-12-04T09:42:50.2066228Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2067070Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2067885Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2068383Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2068807Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2069427Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2071018Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2072413Z graph_break [] 2025-12-04T09:42:50.2072828Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2073650Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2074510Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2075064Z Traceback (most recent call last): 2025-12-04T09:42:50.2075783Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2076571Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2077279Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2077979Z ).run(bench_out) 2025-12-04T09:42:50.2078386Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2078854Z None UNK cufi44wmvf 2025-12-04T09:42:50.2079532Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2080630Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2081373Z ~~~~ <--- HERE 2025-12-04T09:42:50.2082101Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2082831Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2083040Z 2025-12-04T09:42:50.2083047Z 2025-12-04T09:42:50.2083289Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2084136Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2084750Z 2025-12-04T09:42:50.2085038Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2085683Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2086256Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2086681Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2087312Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2088915Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2090380Z graph_break [] 2025-12-04T09:42:50.2090795Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2091622Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2092438Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2092943Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2093363Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2093987Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2095586Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2097039Z graph_break [] 2025-12-04T09:42:50.2097451Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2098271Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2099077Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2099575Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2099990Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2100622Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2102215Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2103603Z graph_break [] 2025-12-04T09:42:50.2104013Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2104826Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2105677Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2106302Z Traceback (most recent call last): 2025-12-04T09:42:50.2107028Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2107858Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2108571Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2109268Z ).run(bench_out) 2025-12-04T09:42:50.2109684Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2110152Z None UNK cufi44wmvf 2025-12-04T09:42:50.2110829Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2111850Z 0.009ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2112595Z ~~~~ <--- HERE 2025-12-04T09:42:50.2113330Z 0.009ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2114052Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2114254Z 2025-12-04T09:42:50.2114260Z 2025-12-04T09:42:50.2114508Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2115437Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2116099Z 2025-12-04T09:42:50.2116393Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2117036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2117538Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2117950Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2118582Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2120193Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2121578Z graph_break [] 2025-12-04T09:42:50.2121996Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2122835Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2123646Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2124147Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2124558Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2125184Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2126830Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2128228Z graph_break [] 2025-12-04T09:42:50.2128641Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2129467Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2130275Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2130774Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2131185Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2131805Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2133500Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2134886Z graph_break [] 2025-12-04T09:42:50.2135299Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2136170Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2136970Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2137470Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2137879Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2138503Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2140116Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2141496Z graph_break [] 2025-12-04T09:42:50.2141908Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2142720Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2143649Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2144208Z Traceback (most recent call last): 2025-12-04T09:42:50.2144925Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2145661Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2146422Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2147119Z ).run(bench_out) 2025-12-04T09:42:50.2147527Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2147995Z None UNK cufi44wmvf 2025-12-04T09:42:50.2148683Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2149718Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2150467Z ~~~~ <--- HERE 2025-12-04T09:42:50.2151196Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2151920Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2152120Z 2025-12-04T09:42:50.2152126Z 2025-12-04T09:42:50.2152374Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2153215Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2153815Z 2025-12-04T09:42:50.2154112Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2154754Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2155266Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2155679Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2156347Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2157938Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2159322Z graph_break [] 2025-12-04T09:42:50.2159736Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2160640Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2161447Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2161950Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2162366Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2162987Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2164575Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2166002Z graph_break [] 2025-12-04T09:42:50.2166414Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2167234Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2168052Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2168551Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2168961Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2169667Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2171249Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2172632Z graph_break [] 2025-12-04T09:42:50.2173042Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2173855Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2174663Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2175163Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2175565Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2176256Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2177885Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2179291Z graph_break [] 2025-12-04T09:42:50.2179699Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2180506Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2181320Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2181820Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2182226Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2182851Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2184445Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2185848Z graph_break [] 2025-12-04T09:42:50.2186319Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2187149Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2187961Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2188677Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2189091Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2189710Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2191312Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2192701Z graph_break [] 2025-12-04T09:42:50.2193105Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2193926Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2194783Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2195339Z Traceback (most recent call last): 2025-12-04T09:42:50.2196113Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2196859Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2197658Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2198366Z ).run(bench_out) 2025-12-04T09:42:50.2198772Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2199241Z None UNK cufi44wmvf 2025-12-04T09:42:50.2199918Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2200934Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2201675Z ~~~~ <--- HERE 2025-12-04T09:42:50.2202404Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2203132Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2203344Z 2025-12-04T09:42:50.2203349Z 2025-12-04T09:42:50.2203589Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2204430Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2205039Z 2025-12-04T09:42:50.2205324Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2206007Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2206513Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2206940Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2207580Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2209211Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2210605Z graph_break [] 2025-12-04T09:42:50.2211018Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2211843Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2212657Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2213158Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2213567Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2214189Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2215871Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2217319Z graph_break [] 2025-12-04T09:42:50.2217731Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2218555Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2219378Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2219879Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2220288Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2220907Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2222501Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2223984Z graph_break [] 2025-12-04T09:42:50.2224394Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2225204Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2226079Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2226578Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2226987Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2227608Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2229207Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2230596Z graph_break [] 2025-12-04T09:42:50.2231020Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2231835Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2232426Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2232743Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2233004Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2233395Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2234398Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2235272Z graph_break [] 2025-12-04T09:42:50.2235533Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2236099Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2236611Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2236927Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2237189Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2237581Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2238633Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2239514Z graph_break [] 2025-12-04T09:42:50.2239776Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2240297Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2240809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2241125Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2241385Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2241778Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2242776Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2243722Z graph_break [] 2025-12-04T09:42:50.2243982Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2244488Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2245075Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2245428Z Traceback (most recent call last): 2025-12-04T09:42:50.2245886Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2246397Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2246851Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2247299Z ).run(bench_out) 2025-12-04T09:42:50.2247559Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2247857Z None UNK cufi44wmvf 2025-12-04T09:42:50.2248297Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2248947Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2249417Z ~~~~ <--- HERE 2025-12-04T09:42:50.2249870Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2250321Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2250450Z 2025-12-04T09:42:50.2250453Z 2025-12-04T09:42:50.2250604Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2251136Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2251515Z 2025-12-04T09:42:50.2251699Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2252103Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2252425Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2252681Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2253074Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2254079Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2254948Z graph_break [] 2025-12-04T09:42:50.2255206Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2273255Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2273882Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2274211Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2274491Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2274898Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2275911Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2276827Z graph_break [] 2025-12-04T09:42:50.2277098Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2277632Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2278147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2278465Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2278733Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2279203Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2280206Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2281080Z graph_break [] 2025-12-04T09:42:50.2281347Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2281867Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2282386Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2282710Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2282978Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2283384Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2284394Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2285269Z graph_break [] 2025-12-04T09:42:50.2285534Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2286106Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2286632Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2286951Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2287219Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2287620Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2288641Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2289517Z graph_break [] 2025-12-04T09:42:50.2289786Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2290318Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2290842Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2291217Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2291489Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2291891Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2292898Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2293774Z graph_break [] 2025-12-04T09:42:50.2294039Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2294554Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2295071Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2295397Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2295670Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2296110Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2297169Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2298052Z graph_break [] 2025-12-04T09:42:50.2298322Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2298837Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2299353Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2299677Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2299952Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2300353Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2301353Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2302236Z graph_break [] 2025-12-04T09:42:50.2302506Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2303023Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2303535Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2303857Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2304126Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2304530Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2305534Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2306455Z graph_break [] 2025-12-04T09:42:50.2306719Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2307230Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2307763Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2308117Z Traceback (most recent call last): 2025-12-04T09:42:50.2308582Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2309124Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2309577Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2310025Z ).run(bench_out) 2025-12-04T09:42:50.2310294Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2310597Z None UNK cufi44wmvf 2025-12-04T09:42:50.2311038Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2311691Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2312164Z ~~~~ <--- HERE 2025-12-04T09:42:50.2312629Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2313084Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2313218Z 2025-12-04T09:42:50.2313229Z 2025-12-04T09:42:50.2313386Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2313982Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2314365Z 2025-12-04T09:42:50.2314547Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2314956Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2315277Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2315540Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2315987Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2317009Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2317881Z graph_break [] 2025-12-04T09:42:50.2318152Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2318672Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2319187Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2319504Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2319768Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2320162Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2321164Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2322033Z graph_break [] 2025-12-04T09:42:50.2322294Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2322812Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2323323Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2323638Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2323901Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2324298Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2325348Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2326272Z graph_break [] 2025-12-04T09:42:50.2326534Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2327054Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2327563Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2327879Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2328140Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2328537Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2329544Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2330424Z graph_break [] 2025-12-04T09:42:50.2330684Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2331253Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2331763Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2332079Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2332339Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2332736Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2333736Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2334615Z graph_break [] 2025-12-04T09:42:50.2334876Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2335391Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2335911Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2336268Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2336528Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2336919Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2337921Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2338785Z graph_break [] 2025-12-04T09:42:50.2339049Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2339567Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2340085Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2340402Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2340665Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2341056Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2342055Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2342923Z graph_break [] 2025-12-04T09:42:50.2343243Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2343752Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2344255Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2344570Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2344831Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2345227Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2346285Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2347148Z graph_break [] 2025-12-04T09:42:50.2347409Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2347929Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2348436Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2348807Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2349069Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2349460Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2350456Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2351326Z graph_break [] 2025-12-04T09:42:50.2351585Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2352100Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2352606Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2352928Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2353188Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2353581Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2354582Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2355452Z graph_break [] 2025-12-04T09:42:50.2355712Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2356276Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2356780Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2357095Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2357361Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2357754Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2358750Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2359616Z graph_break [] 2025-12-04T09:42:50.2359877Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2360394Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2360957Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2361276Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2361530Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2361924Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2362914Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2363773Z graph_break [] 2025-12-04T09:42:50.2364028Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2364540Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2365051Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2365364Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2365623Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2366112Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2367105Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2367971Z graph_break [] 2025-12-04T09:42:50.2368227Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2368730Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2369230Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2369544Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2369801Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2370192Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2371187Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2372052Z graph_break [] 2025-12-04T09:42:50.2372309Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2372825Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2373333Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2373649Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2373906Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2374296Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2375302Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2376210Z graph_break [] 2025-12-04T09:42:50.2376467Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2376980Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2377481Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2377791Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2378122Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2378512Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2379502Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2380370Z graph_break [] 2025-12-04T09:42:50.2380625Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2381140Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2381647Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2381956Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2382218Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2382607Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2383608Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2384527Z graph_break [] 2025-12-04T09:42:50.2384782Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2385289Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2385790Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2386143Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2386401Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2386795Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2387802Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2388669Z graph_break [] 2025-12-04T09:42:50.2388925Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2389421Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2389943Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2390292Z Traceback (most recent call last): 2025-12-04T09:42:50.2390751Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2391223Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2391672Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2392112Z ).run(bench_out) 2025-12-04T09:42:50.2392376Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2392673Z None UNK cufi44wmvf 2025-12-04T09:42:50.2393103Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2393748Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2394211Z ~~~~ <--- HERE 2025-12-04T09:42:50.2394665Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2395171Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2395301Z 2025-12-04T09:42:50.2395308Z 2025-12-04T09:42:50.2395461Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2396047Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2396428Z 2025-12-04T09:42:50.2396610Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2397018Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2397332Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2397589Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2397982Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2398979Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2399849Z graph_break [] 2025-12-04T09:42:50.2400166Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2400682Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2401188Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2401501Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2401757Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2402147Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2402921Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2403339Z graph_break [] 2025-12-04T09:42:50.2403466Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2403718Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2403963Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2404116Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2404243Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2404434Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2404917Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2405337Z graph_break [] 2025-12-04T09:42:50.2405461Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2405712Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2405988Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2406141Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2406269Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2406462Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2406969Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2407387Z graph_break [] 2025-12-04T09:42:50.2407513Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2407763Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2408013Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2408165Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2408291Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2408480Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2408960Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2409377Z graph_break [] 2025-12-04T09:42:50.2409505Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2409758Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2410037Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2410190Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2410315Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2410506Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2410986Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2411408Z graph_break [] 2025-12-04T09:42:50.2411539Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2411789Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2412042Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2412194Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2412321Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2412510Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2412989Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2413406Z graph_break [] 2025-12-04T09:42:50.2413536Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2413782Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2414025Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2414179Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2414307Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2414498Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2414989Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2415408Z graph_break [] 2025-12-04T09:42:50.2415537Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2415811Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2416100Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2416257Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2416384Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2416575Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2417055Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2417477Z graph_break [] 2025-12-04T09:42:50.2417601Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2417853Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2418097Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2418249Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2418417Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2418610Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2419092Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2419511Z graph_break [] 2025-12-04T09:42:50.2419636Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2419884Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2420129Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2420281Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2420408Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2420599Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2421078Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2421493Z graph_break [] 2025-12-04T09:42:50.2421620Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2423819Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2424070Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2424224Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2424351Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2424540Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2425022Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2425444Z graph_break [] 2025-12-04T09:42:50.2425568Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2425817Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2426174Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2426328Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2426457Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2426649Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2427135Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2427560Z graph_break [] 2025-12-04T09:42:50.2427690Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2427936Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2428182Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2428345Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2428476Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2428670Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2429168Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2429591Z graph_break [] 2025-12-04T09:42:50.2429721Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2429978Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2430228Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2430383Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2430518Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2430711Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2431199Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2431619Z graph_break [] 2025-12-04T09:42:50.2431747Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2432000Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2432252Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2432474Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2432608Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2432802Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2433285Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2433710Z graph_break [] 2025-12-04T09:42:50.2433838Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2434090Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2434337Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2434495Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2434623Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2434842Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2435319Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2435733Z graph_break [] 2025-12-04T09:42:50.2435856Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2436144Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2436392Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2436547Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2436679Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2436874Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2437355Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2437787Z graph_break [] 2025-12-04T09:42:50.2437918Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2438164Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2438405Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2438560Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2438691Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2438888Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2439370Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2439792Z graph_break [] 2025-12-04T09:42:50.2439920Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2440170Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2440430Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2440605Z Traceback (most recent call last): 2025-12-04T09:42:50.2440832Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2441088Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2441309Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2441528Z ).run(bench_out) 2025-12-04T09:42:50.2441658Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2441807Z None UNK cufi44wmvf 2025-12-04T09:42:50.2442018Z 0.009ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2442332Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2442560Z ~~~~ <--- HERE 2025-12-04T09:42:50.2442785Z 0.010ms 0.000 GB 0.00GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2443006Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2443103Z 2025-12-04T09:42:50.2443105Z 2025-12-04T09:42:50.2443185Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2443449Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2443636Z 2025-12-04T09:42:50.2443725Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2443925Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2444088Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2444217Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2444411Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2444903Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2445329Z graph_break [] 2025-12-04T09:42:50.2445457Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2445980Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2446228Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2446382Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2446513Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2446704Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2447188Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2447613Z graph_break [] 2025-12-04T09:42:50.2447744Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2447998Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2448247Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2448402Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2448530Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2448722Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2449201Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2449638Z graph_break [] 2025-12-04T09:42:50.2449767Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2450016Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2450268Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2450425Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2450554Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2450746Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2451235Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2451657Z graph_break [] 2025-12-04T09:42:50.2451822Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2452071Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2452317Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2452470Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2452598Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2452791Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2453276Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2453695Z graph_break [] 2025-12-04T09:42:50.2453824Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2454079Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2454328Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2454498Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2454627Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2454821Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2455301Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2455719Z graph_break [] 2025-12-04T09:42:50.2455847Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2456137Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2456385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2456542Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2456671Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2456861Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2457341Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2457759Z graph_break [] 2025-12-04T09:42:50.2457904Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2458150Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2458395Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2458548Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2458678Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2458871Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2459352Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2459772Z graph_break [] 2025-12-04T09:42:50.2459899Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2460180Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2460424Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2460579Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2460710Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2460905Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2461384Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2461805Z graph_break [] 2025-12-04T09:42:50.2461934Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2462184Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2462433Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2462588Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2462717Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2462926Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2463406Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2463826Z graph_break [] 2025-12-04T09:42:50.2463955Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2464203Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2464453Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2464608Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2464737Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2464933Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2465418Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2465835Z graph_break [] 2025-12-04T09:42:50.2466005Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2466259Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2466524Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2466680Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2466808Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2466999Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2467479Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2467898Z graph_break [] 2025-12-04T09:42:50.2468027Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2468277Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2468527Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2468715Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2468844Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2469035Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2469517Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2469943Z graph_break [] 2025-12-04T09:42:50.2470076Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2470321Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2470566Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2470725Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2470863Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2471059Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2471541Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2471982Z graph_break [] 2025-12-04T09:42:50.2472114Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2472372Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2472626Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2472788Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2472922Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2473118Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2473604Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2474032Z graph_break [] 2025-12-04T09:42:50.2474166Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2474422Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2474677Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2474837Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2474984Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2475180Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2475664Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2476122Z graph_break [] 2025-12-04T09:42:50.2476256Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2476513Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2476768Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2476928Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2477055Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2477280Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2477772Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2478200Z graph_break [] 2025-12-04T09:42:50.2478331Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2478586Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2478836Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2478994Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2479121Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2479319Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2479805Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2480254Z graph_break [] 2025-12-04T09:42:50.2480386Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2480628Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2480875Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2481030Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2481153Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2481349Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2481839Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2482264Z graph_break [] 2025-12-04T09:42:50.2482391Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2482643Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2482892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2483050Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2483182Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2483376Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2483878Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2484302Z graph_break [] 2025-12-04T09:42:50.2484431Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2484684Z Compiled module path: /tmp/tmp_hj1kmqe/qt/cqtjfgww4epqrnqrizrwk4djvl5xpnfncvqc2r3qdxnxtgqxpkcl.py 2025-12-04T09:42:50.2484935Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2485087Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2485214Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2485407Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2485963Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2486378Z graph_break [] 2025-12-04T09:42:50.2486512Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2486764Z Compiled module path: /tmp/tmpxlcjsc05/ko/ckoxwsmt4vqls6z6tx63k3uruzrpei5xkiuu24xjdowqcqqvmgyp.py 2025-12-04T09:42:50.2487025Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2487199Z Traceback (most recent call last): 2025-12-04T09:42:50.2487427Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2487658Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2487879Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2488100Z ).run(bench_out) 2025-12-04T09:42:50.2488236Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2488384Z None UNK cufi44wmvf 2025-12-04T09:42:50.2488596Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2488931Z 0.009ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2489158Z ~~~~ <--- HERE 2025-12-04T09:42:50.2489382Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2489602Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2489667Z 2025-12-04T09:42:50.2489670Z 2025-12-04T09:42:50.2489746Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2490010Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2490197Z 2025-12-04T09:42:50.2490286Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2490487Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2490646Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2490774Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2490968Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2491451Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2491890Z graph_break [] 2025-12-04T09:42:50.2492022Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2492277Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2492533Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2492688Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2492818Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2493014Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2493498Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2493921Z graph_break [] 2025-12-04T09:42:50.2494079Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2494332Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2494411Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2494455Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2494516Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2494616Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2494966Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2495003Z graph_break [] 2025-12-04T09:42:50.2495081Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2495220Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2495296Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2495352Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2495412Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2495510Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2495856Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2495897Z graph_break [] 2025-12-04T09:42:50.2496022Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2496166Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2496239Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2496285Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2496343Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2496445Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2496794Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2496833Z graph_break [] 2025-12-04T09:42:50.2496905Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2497066Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2497136Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2497178Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2497231Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2497331Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2497677Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2497717Z graph_break [] 2025-12-04T09:42:50.2497788Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2497927Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2498024Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2498068Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2498122Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2498222Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2498565Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2498603Z graph_break [] 2025-12-04T09:42:50.2498677Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2498809Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2498884Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2498925Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2498982Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2499078Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2499440Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2499476Z graph_break [] 2025-12-04T09:42:50.2499551Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2499683Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2499758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2499800Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2499857Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2499953Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2500299Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2500335Z graph_break [] 2025-12-04T09:42:50.2500409Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2500540Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2500613Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2500667Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2500726Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2500824Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2501173Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2501209Z graph_break [] 2025-12-04T09:42:50.2501283Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2501414Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2501487Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2501532Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2501609Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2501708Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2502052Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2502093Z graph_break [] 2025-12-04T09:42:50.2502165Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2502303Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2502374Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2502419Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2502474Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2502575Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2502921Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2502976Z graph_break [] 2025-12-04T09:42:50.2503047Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2503186Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2503258Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2503301Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2503358Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2503458Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2503800Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2503839Z graph_break [] 2025-12-04T09:42:50.2503910Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2504041Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2504112Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2504155Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2504209Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2504330Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2504674Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2504713Z graph_break [] 2025-12-04T09:42:50.2504788Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2504925Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2505000Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2505041Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2505099Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2505198Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2505573Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2505611Z graph_break [] 2025-12-04T09:42:50.2505686Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2505821Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2505895Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2505985Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2506041Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2506137Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2506490Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2506546Z graph_break [] 2025-12-04T09:42:50.2506621Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2506755Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2506828Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2506870Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2506927Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2507023Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2507378Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2507418Z graph_break [] 2025-12-04T09:42:50.2507490Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2507625Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2507697Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2507740Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2507794Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2507893Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2508238Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2508291Z graph_break [] 2025-12-04T09:42:50.2508368Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2508499Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2508570Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2508613Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2508669Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2508768Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2509139Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2509179Z graph_break [] 2025-12-04T09:42:50.2509252Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2509390Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2509461Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2509504Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2509557Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2509655Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2510000Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2519877Z graph_break [] 2025-12-04T09:42:50.2519962Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2520150Z Compiled module path: /tmp/tmp_hj1kmqe/qt/cqtjfgww4epqrnqrizrwk4djvl5xpnfncvqc2r3qdxnxtgqxpkcl.py 2025-12-04T09:42:50.2520229Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2520276Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2520343Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2520451Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2520807Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2520847Z graph_break [] 2025-12-04T09:42:50.2520927Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2521065Z Compiled module path: /tmp/tmpxlcjsc05/ko/ckoxwsmt4vqls6z6tx63k3uruzrpei5xkiuu24xjdowqcqqvmgyp.py 2025-12-04T09:42:50.2521142Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2521183Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2521242Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2521341Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2521687Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2521746Z graph_break [] 2025-12-04T09:42:50.2521823Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2521959Z Compiled module path: /tmp/tmpfwo2vg1z/r6/cr65mqq2yuhrqjtrnsbv5n2brgxyknu52tpxcuaubovpaebdxx74.py 2025-12-04T09:42:50.2522047Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2522095Z Traceback (most recent call last): 2025-12-04T09:42:50.2522248Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2522293Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2522437Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2522475Z ).run(bench_out) 2025-12-04T09:42:50.2522546Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2522592Z None UNK cufi44wmvf 2025-12-04T09:42:50.2522759Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2522906Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2522959Z ~~~~ <--- HERE 2025-12-04T09:42:50.2523101Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2523143Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2523146Z 2025-12-04T09:42:50.2523148Z 2025-12-04T09:42:50.2523229Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2523380Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2523384Z 2025-12-04T09:42:50.2523477Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2523552Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2523597Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2523668Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2523770Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2524115Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2524155Z graph_break [] 2025-12-04T09:42:50.2524228Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2524370Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2524447Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2524492Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2524547Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2524649Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2524997Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2525034Z graph_break [] 2025-12-04T09:42:50.2525106Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2525243Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2525330Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2525376Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2525430Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2525530Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2525877Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2525914Z graph_break [] 2025-12-04T09:42:50.2526049Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2526184Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2526290Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2526332Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2526388Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2526485Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2526832Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2526867Z graph_break [] 2025-12-04T09:42:50.2526942Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2527077Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2527152Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2527194Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2527250Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2527346Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2527717Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2527754Z graph_break [] 2025-12-04T09:42:50.2527828Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2527963Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2528041Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2528083Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2528144Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2528240Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2528586Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2528627Z graph_break [] 2025-12-04T09:42:50.2528701Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2528840Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2528912Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2528968Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2529024Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2529124Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2529466Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2529505Z graph_break [] 2025-12-04T09:42:50.2529576Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2529707Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2529779Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2529821Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2529877Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2529995Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2530338Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2530378Z graph_break [] 2025-12-04T09:42:50.2530450Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2530584Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2530655Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2530697Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2530749Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2530851Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2531196Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2531246Z graph_break [] 2025-12-04T09:42:50.2531318Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2531451Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2531524Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2531565Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2531620Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2531715Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2532064Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2532101Z graph_break [] 2025-12-04T09:42:50.2532176Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2532309Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2532383Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2532424Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2532479Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2532574Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2532930Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2532966Z graph_break [] 2025-12-04T09:42:50.2533040Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2533174Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2533247Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2533286Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2533340Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2533435Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2533825Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2533862Z graph_break [] 2025-12-04T09:42:50.2533935Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2534069Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2534141Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2534182Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2534236Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2534334Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2534678Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2534714Z graph_break [] 2025-12-04T09:42:50.2534798Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2534926Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2534997Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2535038Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2535090Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2535189Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2535536Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2535572Z graph_break [] 2025-12-04T09:42:50.2535643Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2535782Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2535852Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2535892Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2535983Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2536080Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2536422Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2536482Z graph_break [] 2025-12-04T09:42:50.2536552Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2536688Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2536758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2536799Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2536852Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2536948Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2537315Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2537352Z graph_break [] 2025-12-04T09:42:50.2537424Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2537558Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2537631Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2537671Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2537725Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2537821Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2538164Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2538201Z graph_break [] 2025-12-04T09:42:50.2538273Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2538405Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2538492Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2538533Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2538586Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2538681Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2539023Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2539058Z graph_break [] 2025-12-04T09:42:50.2539132Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2539260Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2539334Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2539374Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2539428Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2539523Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2539868Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2539918Z graph_break [] 2025-12-04T09:42:50.2539990Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2540124Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2540194Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2540237Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2540291Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2540387Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2540729Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2540765Z graph_break [] 2025-12-04T09:42:50.2540836Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2540994Z Compiled module path: /tmp/tmp_hj1kmqe/qt/cqtjfgww4epqrnqrizrwk4djvl5xpnfncvqc2r3qdxnxtgqxpkcl.py 2025-12-04T09:42:50.2541065Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2541106Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2541158Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2541254Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2541595Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2541631Z graph_break [] 2025-12-04T09:42:50.2541702Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2541840Z Compiled module path: /tmp/tmpxlcjsc05/ko/ckoxwsmt4vqls6z6tx63k3uruzrpei5xkiuu24xjdowqcqqvmgyp.py 2025-12-04T09:42:50.2541910Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2541951Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2542019Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2542116Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2542457Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2542494Z graph_break [] 2025-12-04T09:42:50.2542566Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2542701Z Compiled module path: /tmp/tmpfwo2vg1z/r6/cr65mqq2yuhrqjtrnsbv5n2brgxyknu52tpxcuaubovpaebdxx74.py 2025-12-04T09:42:50.2542773Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2542813Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2542870Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2542966Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2543309Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2543344Z graph_break [] 2025-12-04T09:42:50.2543416Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2543565Z Compiled module path: /tmp/tmpffjovlv2/on/convo7nb66xejt2vizpg5taq2hzkarva6kvlpikhakxj45gv5juf.py 2025-12-04T09:42:50.2543651Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2543697Z Traceback (most recent call last): 2025-12-04T09:42:50.2543844Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2543889Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2544029Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2544066Z ).run(bench_out) 2025-12-04T09:42:50.2544134Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2544177Z None UNK cufi44wmvf 2025-12-04T09:42:50.2544310Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2544472Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2544519Z ~~~~ <--- HERE 2025-12-04T09:42:50.2544658Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2544701Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2544703Z 2025-12-04T09:42:50.2544705Z 2025-12-04T09:42:50.2544780Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2544928Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2544931Z 2025-12-04T09:42:50.2545020Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2545093Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2545136Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2545192Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2545292Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2545635Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2545687Z graph_break [] 2025-12-04T09:42:50.2545760Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2545899Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2546008Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2546051Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2546105Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2546204Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2546546Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2546584Z graph_break [] 2025-12-04T09:42:50.2546656Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2546791Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2546862Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2546903Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2546971Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2547070Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2547413Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2547450Z graph_break [] 2025-12-04T09:42:50.2547521Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2547655Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2547728Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2547768Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2547822Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2547944Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2548289Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2548326Z graph_break [] 2025-12-04T09:42:50.2548398Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2548531Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2548602Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2548642Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2548696Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2548793Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2549138Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2549189Z graph_break [] 2025-12-04T09:42:50.2549261Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2549396Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2549467Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2549506Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2549560Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2549655Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2550003Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2550039Z graph_break [] 2025-12-04T09:42:50.2550111Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2550246Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2550317Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2550358Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2550412Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2550509Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2550863Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2550901Z graph_break [] 2025-12-04T09:42:50.2550972Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2551103Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2551173Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2551214Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2551266Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2551364Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2551748Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2551786Z graph_break [] 2025-12-04T09:42:50.2551858Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2551993Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2552064Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2552105Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2552158Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2552256Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2552601Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2552638Z graph_break [] 2025-12-04T09:42:50.2552710Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2552854Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2552924Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2552965Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2553019Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2553115Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2553460Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2553496Z graph_break [] 2025-12-04T09:42:50.2553569Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2553704Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2553776Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2553816Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2553871Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2553967Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2554312Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2554358Z graph_break [] 2025-12-04T09:42:50.2554430Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2554564Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2554636Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2554675Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2554729Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2554824Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2555169Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2555205Z graph_break [] 2025-12-04T09:42:50.2555306Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2555440Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2555512Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2555553Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2555607Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2555701Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2556093Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2556133Z graph_break [] 2025-12-04T09:42:50.2556207Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2556338Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2556428Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2556472Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2556527Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2556630Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2556977Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2557018Z graph_break [] 2025-12-04T09:42:50.2557091Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2557229Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2557301Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2557346Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2557398Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2557497Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2557841Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2557880Z graph_break [] 2025-12-04T09:42:50.2557967Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2558107Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2558179Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2558224Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2558279Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2558378Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2558724Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2558762Z graph_break [] 2025-12-04T09:42:50.2558836Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2558997Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2559072Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2559113Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2559172Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2559267Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2559613Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2559649Z graph_break [] 2025-12-04T09:42:50.2559723Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2559859Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2559932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2559973Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2560033Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2560141Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2560488Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2560523Z graph_break [] 2025-12-04T09:42:50.2560598Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2560726Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2560803Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2560844Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2560902Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2560999Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2561346Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2561384Z graph_break [] 2025-12-04T09:42:50.2561457Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2561593Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2561679Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2561722Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2561777Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2561876Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2562220Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2562259Z graph_break [] 2025-12-04T09:42:50.2562331Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2562464Z Compiled module path: /tmp/tmp_hj1kmqe/qt/cqtjfgww4epqrnqrizrwk4djvl5xpnfncvqc2r3qdxnxtgqxpkcl.py 2025-12-04T09:42:50.2562537Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2562602Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2562657Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2562756Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2563105Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2563143Z graph_break [] 2025-12-04T09:42:50.2563214Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2563348Z Compiled module path: /tmp/tmpxlcjsc05/ko/ckoxwsmt4vqls6z6tx63k3uruzrpei5xkiuu24xjdowqcqqvmgyp.py 2025-12-04T09:42:50.2563419Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2563461Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2563515Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2563611Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2563968Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2564007Z graph_break [] 2025-12-04T09:42:50.2564078Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2564214Z Compiled module path: /tmp/tmpfwo2vg1z/r6/cr65mqq2yuhrqjtrnsbv5n2brgxyknu52tpxcuaubovpaebdxx74.py 2025-12-04T09:42:50.2564288Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2564331Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2564389Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2564485Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2564829Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2564866Z graph_break [] 2025-12-04T09:42:50.2564943Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2565079Z Compiled module path: /tmp/tmpffjovlv2/on/convo7nb66xejt2vizpg5taq2hzkarva6kvlpikhakxj45gv5juf.py 2025-12-04T09:42:50.2565152Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2565193Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2565262Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2565361Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2565708Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2565744Z graph_break [] 2025-12-04T09:42:50.2565819Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2565995Z Compiled module path: /tmp/tmpxwfglx8h/rp/crpmw273fvhq24tatgvd734xavjhkrdt6cmqwg2hmtkpxxexuppr.py 2025-12-04T09:42:50.2566086Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2566131Z Traceback (most recent call last): 2025-12-04T09:42:50.2566279Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2566352Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2566496Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2566535Z ).run(bench_out) 2025-12-04T09:42:50.2566605Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2566649Z None UNK cufi44wmvf 2025-12-04T09:42:50.2566782Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2566923Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2566970Z ~~~~ <--- HERE 2025-12-04T09:42:50.2567112Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2567154Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2567156Z 2025-12-04T09:42:50.2567158Z 2025-12-04T09:42:50.2567234Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2567398Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2567400Z 2025-12-04T09:42:50.2567488Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2567561Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2567605Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2567660Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2567761Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2568112Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2568151Z graph_break [] 2025-12-04T09:42:50.2568228Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2568371Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2568443Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2568486Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2568541Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2568642Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2568986Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2569046Z graph_break [] 2025-12-04T09:42:50.2569118Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2569260Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2569332Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2569376Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2569429Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2569528Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2569892Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2569929Z graph_break [] 2025-12-04T09:42:50.2570004Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2570140Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2570215Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2570256Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2570311Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2570407Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2570752Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2570789Z graph_break [] 2025-12-04T09:42:50.2570862Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2570996Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2571085Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2571126Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2571183Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2571280Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2571624Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2571663Z graph_break [] 2025-12-04T09:42:50.2571737Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2571873Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2571949Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2571990Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2572049Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2572145Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2572490Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2572540Z graph_break [] 2025-12-04T09:42:50.2572613Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2572750Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2572822Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2572866Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2572919Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2573016Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2573360Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2573400Z graph_break [] 2025-12-04T09:42:50.2573491Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2573625Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2573697Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2573741Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2573795Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2573894Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2574240Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2574278Z graph_break [] 2025-12-04T09:42:50.2574351Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2574488Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2574561Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2574618Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2574672Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2574770Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2575113Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2575150Z graph_break [] 2025-12-04T09:42:50.2575223Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2575358Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2575432Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2575473Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2575530Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2575626Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2576040Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2576076Z graph_break [] 2025-12-04T09:42:50.2576151Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2576302Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2576377Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2576418Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2576480Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2576577Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2576926Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2576962Z graph_break [] 2025-12-04T09:42:50.2577036Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2577172Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2577269Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2577310Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2577367Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2577464Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2577807Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2577845Z graph_break [] 2025-12-04T09:42:50.2577915Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2578053Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2578129Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2578172Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2578227Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2578325Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2578684Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2578721Z graph_break [] 2025-12-04T09:42:50.2578794Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2578925Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2578999Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2579047Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2579102Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2579201Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2579549Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2579587Z graph_break [] 2025-12-04T09:42:50.2579658Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2579797Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2579880Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2579926Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2579980Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2580078Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2580422Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2580460Z graph_break [] 2025-12-04T09:42:50.2580532Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2580667Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2580741Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2580784Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2580858Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2580955Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2581304Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2581341Z graph_break [] 2025-12-04T09:42:50.2581415Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2581551Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2581628Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2581670Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2581725Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2581823Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2582168Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2582224Z graph_break [] 2025-12-04T09:42:50.2582298Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2582432Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2582504Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2582545Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2582603Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2582699Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2583045Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2583083Z graph_break [] 2025-12-04T09:42:50.2583158Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2583287Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2583361Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2583402Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2583459Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2583570Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2583916Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2583956Z graph_break [] 2025-12-04T09:42:50.2584030Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2584166Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2584240Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2584285Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2584341Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2584441Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2584804Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2584844Z graph_break [] 2025-12-04T09:42:50.2584917Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2585053Z Compiled module path: /tmp/tmp_hj1kmqe/qt/cqtjfgww4epqrnqrizrwk4djvl5xpnfncvqc2r3qdxnxtgqxpkcl.py 2025-12-04T09:42:50.2585125Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2585169Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2585222Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2585319Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2585668Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2585721Z graph_break [] 2025-12-04T09:42:50.2585793Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2585969Z Compiled module path: /tmp/tmpxlcjsc05/ko/ckoxwsmt4vqls6z6tx63k3uruzrpei5xkiuu24xjdowqcqqvmgyp.py 2025-12-04T09:42:50.2586041Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2586085Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2586138Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2586237Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2586588Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2586626Z graph_break [] 2025-12-04T09:42:50.2586701Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2586836Z Compiled module path: /tmp/tmpfwo2vg1z/r6/cr65mqq2yuhrqjtrnsbv5n2brgxyknu52tpxcuaubovpaebdxx74.py 2025-12-04T09:42:50.2586909Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2586951Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2587007Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2587105Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2587465Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2587502Z graph_break [] 2025-12-04T09:42:50.2587578Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2587712Z Compiled module path: /tmp/tmpffjovlv2/on/convo7nb66xejt2vizpg5taq2hzkarva6kvlpikhakxj45gv5juf.py 2025-12-04T09:42:50.2587786Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2587826Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2587884Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2587980Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2588349Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2588387Z graph_break [] 2025-12-04T09:42:50.2588464Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2588602Z Compiled module path: /tmp/tmpxwfglx8h/rp/crpmw273fvhq24tatgvd734xavjhkrdt6cmqwg2hmtkpxxexuppr.py 2025-12-04T09:42:50.2588680Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2588721Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2588779Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2588875Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2589222Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2589262Z graph_break [] 2025-12-04T09:42:50.2589335Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2589486Z Compiled module path: /tmp/tmp5psoaoef/dr/cdry3zflyho3wrc4yjjs7oro5tq4a7dzgs5va7zzcxb5de55fcz7.py 2025-12-04T09:42:50.2589571Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2589620Z Traceback (most recent call last): 2025-12-04T09:42:50.2589764Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2589812Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2589952Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2589993Z ).run(bench_out) 2025-12-04T09:42:50.2590060Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2590107Z None UNK cufi44wmvf 2025-12-04T09:42:50.2590240Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2590384Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2590429Z ~~~~ <--- HERE 2025-12-04T09:42:50.2590572Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2590611Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2590613Z 2025-12-04T09:42:50.2590615Z 2025-12-04T09:42:50.2590692Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2590851Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2590853Z 2025-12-04T09:42:50.2590942Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2591015Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2591060Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2591116Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2591216Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2591564Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2591603Z graph_break [] 2025-12-04T09:42:50.2591678Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2591834Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2591908Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2591952Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2592009Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2592106Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2592451Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2592487Z graph_break [] 2025-12-04T09:42:50.2592560Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2592699Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2592776Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2592818Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2592890Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2592987Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2593335Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2593372Z graph_break [] 2025-12-04T09:42:50.2593448Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2593584Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2593659Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2593701Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2593758Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2593856Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2594201Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2594239Z graph_break [] 2025-12-04T09:42:50.2594312Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2594452Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2594538Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2594581Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2594635Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2594734Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2595079Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2595119Z graph_break [] 2025-12-04T09:42:50.2595193Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2595331Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2595432Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2595477Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2595530Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2595628Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2596020Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2596059Z graph_break [] 2025-12-04T09:42:50.2596131Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2596269Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2596342Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2596390Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2596444Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2596543Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2596902Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2596941Z graph_break [] 2025-12-04T09:42:50.2597015Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2597148Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2597222Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2597264Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2597323Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2597420Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2597766Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2597803Z graph_break [] 2025-12-04T09:42:50.2597879Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2598011Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2598085Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2598138Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2598197Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2598293Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2598639Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2598677Z graph_break [] 2025-12-04T09:42:50.2598751Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2598885Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2598959Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2599001Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2599057Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2599601Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2599950Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2599987Z graph_break [] 2025-12-04T09:42:50.2600062Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2600195Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2600269Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2600311Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2600368Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2600466Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2600810Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2600861Z graph_break [] 2025-12-04T09:42:50.2600933Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2601070Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2601142Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2601186Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2601239Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2601340Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2601684Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2601726Z graph_break [] 2025-12-04T09:42:50.2601798Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2601936Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2602008Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2602052Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2602106Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2602204Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2602562Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2602602Z graph_break [] 2025-12-04T09:42:50.2602674Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2602804Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2602877Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2602922Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2602976Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2603075Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2603444Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2603482Z graph_break [] 2025-12-04T09:42:50.2603556Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2603692Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2603766Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2603808Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2603865Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2603963Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2604312Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2604349Z graph_break [] 2025-12-04T09:42:50.2604437Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2604573Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2604647Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2604689Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2604743Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2604840Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2605185Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2605223Z graph_break [] 2025-12-04T09:42:50.2605298Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2605434Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2605507Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2605548Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2605605Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2605704Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2606087Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2606141Z graph_break [] 2025-12-04T09:42:50.2606213Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2606351Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2606423Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2606466Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2606520Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2606616Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2606986Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2607026Z graph_break [] 2025-12-04T09:42:50.2607099Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2607231Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2607303Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2607347Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2607401Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2607499Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2607845Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2607888Z graph_break [] 2025-12-04T09:42:50.2607961Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2608097Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2608182Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2608226Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2608281Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2608379Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2608724Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2608763Z graph_break [] 2025-12-04T09:42:50.2608841Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2608973Z Compiled module path: /tmp/tmp_hj1kmqe/qt/cqtjfgww4epqrnqrizrwk4djvl5xpnfncvqc2r3qdxnxtgqxpkcl.py 2025-12-04T09:42:50.2609048Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2609089Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2609144Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2609240Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2609584Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2609631Z graph_break [] 2025-12-04T09:42:50.2609708Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2609843Z Compiled module path: /tmp/tmpxlcjsc05/ko/ckoxwsmt4vqls6z6tx63k3uruzrpei5xkiuu24xjdowqcqqvmgyp.py 2025-12-04T09:42:50.2609917Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2609961Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2610016Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2610112Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2610457Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2610494Z graph_break [] 2025-12-04T09:42:50.2610596Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2610730Z Compiled module path: /tmp/tmpfwo2vg1z/r6/cr65mqq2yuhrqjtrnsbv5n2brgxyknu52tpxcuaubovpaebdxx74.py 2025-12-04T09:42:50.2610803Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2610846Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2610903Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2611000Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2611346Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2611381Z graph_break [] 2025-12-04T09:42:50.2611455Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2611590Z Compiled module path: /tmp/tmpffjovlv2/on/convo7nb66xejt2vizpg5taq2hzkarva6kvlpikhakxj45gv5juf.py 2025-12-04T09:42:50.2611664Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2611719Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2611774Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2611873Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2612219Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2612257Z graph_break [] 2025-12-04T09:42:50.2612329Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2612469Z Compiled module path: /tmp/tmpxwfglx8h/rp/crpmw273fvhq24tatgvd734xavjhkrdt6cmqwg2hmtkpxxexuppr.py 2025-12-04T09:42:50.2612540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2612585Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2612640Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2612739Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2613082Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2613119Z graph_break [] 2025-12-04T09:42:50.2613190Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2613339Z Compiled module path: /tmp/tmp5psoaoef/dr/cdry3zflyho3wrc4yjjs7oro5tq4a7dzgs5va7zzcxb5de55fcz7.py 2025-12-04T09:42:50.2613413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2613456Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2613511Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2613611Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2613953Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2613995Z graph_break [] 2025-12-04T09:42:50.2614065Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2614201Z Compiled module path: /tmp/tmpeeyfjgtm/dv/cdve6mxy7mo7r2ih3h4nwpgl5noavz33xgl63f7jsuo5j2yu5kub.py 2025-12-04T09:42:50.2614314Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2614362Z Traceback (most recent call last): 2025-12-04T09:42:50.2614506Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2614555Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2614697Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2614735Z ).run(bench_out) 2025-12-04T09:42:50.2614805Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2614848Z None UNK cufi44wmvf 2025-12-04T09:42:50.2614981Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2615125Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2615172Z ~~~~ <--- HERE 2025-12-04T09:42:50.2615312Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2615365Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2615367Z 2025-12-04T09:42:50.2615369Z 2025-12-04T09:42:50.2615443Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2615592Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2615594Z 2025-12-04T09:42:50.2615683Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2615758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2615800Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2615859Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2615997Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2616347Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2616387Z graph_break [] 2025-12-04T09:42:50.2616463Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2616600Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2616674Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2616715Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2616785Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2616884Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2617230Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2617273Z graph_break [] 2025-12-04T09:42:50.2617344Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2617481Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2617553Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2617597Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2617651Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2617781Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2618120Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2618159Z graph_break [] 2025-12-04T09:42:50.2618233Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2618368Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2618442Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2618485Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2618539Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2618641Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2618984Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2619037Z graph_break [] 2025-12-04T09:42:50.2619109Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2619246Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2619318Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2619362Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2619416Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2619519Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2619865Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2619903Z graph_break [] 2025-12-04T09:42:50.2619976Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2620111Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2620185Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2620228Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2620282Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2620379Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2620739Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2620775Z graph_break [] 2025-12-04T09:42:50.2620849Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2620984Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2621057Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2621098Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2621153Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2621251Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2621618Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2621655Z graph_break [] 2025-12-04T09:42:50.2621729Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2621861Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2621938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2621977Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2622034Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2622130Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2622478Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2622517Z graph_break [] 2025-12-04T09:42:50.2622588Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2622738Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2622811Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2622851Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2622905Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2623004Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2623350Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2623391Z graph_break [] 2025-12-04T09:42:50.2623463Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2623599Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2623670Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2623714Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2623768Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2623866Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2624213Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2624272Z graph_break [] 2025-12-04T09:42:50.2624344Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2624478Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2624550Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2624593Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2624647Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2624744Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2625088Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2625150Z graph_break [] 2025-12-04T09:42:50.2625223Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2625359Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2625434Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2625475Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2625531Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2625626Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2626011Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2626047Z graph_break [] 2025-12-04T09:42:50.2626123Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2626257Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2626350Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2626392Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2626448Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2626543Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2626888Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2626924Z graph_break [] 2025-12-04T09:42:50.2626999Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2627127Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2627201Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2627243Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2627299Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2627396Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2627739Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2627775Z graph_break [] 2025-12-04T09:42:50.2627862Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2628000Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2628076Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2628119Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2628172Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2628270Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2628614Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2628651Z graph_break [] 2025-12-04T09:42:50.2628723Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2628884Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2628954Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2628997Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2629051Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2629149Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2629492Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2629530Z graph_break [] 2025-12-04T09:42:50.2629601Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2629743Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2629815Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2629857Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2629925Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2630023Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2630366Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2630406Z graph_break [] 2025-12-04T09:42:50.2630478Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2630615Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2630689Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2630734Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2630789Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2630888Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2631233Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2631269Z graph_break [] 2025-12-04T09:42:50.2631343Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2631471Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2631556Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2631597Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2631654Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2631751Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2632096Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2632132Z graph_break [] 2025-12-04T09:42:50.2632206Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2632342Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2632418Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2632479Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2632537Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2632636Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2632982Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2633019Z graph_break [] 2025-12-04T09:42:50.2633095Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2633227Z Compiled module path: /tmp/tmp_hj1kmqe/qt/cqtjfgww4epqrnqrizrwk4djvl5xpnfncvqc2r3qdxnxtgqxpkcl.py 2025-12-04T09:42:50.2633304Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2633348Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2633405Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2633502Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2633862Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2633900Z graph_break [] 2025-12-04T09:42:50.2633973Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2634110Z Compiled module path: /tmp/tmpxlcjsc05/ko/ckoxwsmt4vqls6z6tx63k3uruzrpei5xkiuu24xjdowqcqqvmgyp.py 2025-12-04T09:42:50.2634182Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2634227Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2634283Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2634382Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2634730Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2634773Z graph_break [] 2025-12-04T09:42:50.2634844Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2634983Z Compiled module path: /tmp/tmpfwo2vg1z/r6/cr65mqq2yuhrqjtrnsbv5n2brgxyknu52tpxcuaubovpaebdxx74.py 2025-12-04T09:42:50.2635054Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2635108Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2635161Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2635262Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2635605Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2635644Z graph_break [] 2025-12-04T09:42:50.2635717Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2635855Z Compiled module path: /tmp/tmpffjovlv2/on/convo7nb66xejt2vizpg5taq2hzkarva6kvlpikhakxj45gv5juf.py 2025-12-04T09:42:50.2635963Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2636008Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2636064Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2636193Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2636537Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2636577Z graph_break [] 2025-12-04T09:42:50.2636653Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2636789Z Compiled module path: /tmp/tmpxwfglx8h/rp/crpmw273fvhq24tatgvd734xavjhkrdt6cmqwg2hmtkpxxexuppr.py 2025-12-04T09:42:50.2636864Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2636906Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2636963Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2637061Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2637407Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2637465Z graph_break [] 2025-12-04T09:42:50.2637540Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2637674Z Compiled module path: /tmp/tmp5psoaoef/dr/cdry3zflyho3wrc4yjjs7oro5tq4a7dzgs5va7zzcxb5de55fcz7.py 2025-12-04T09:42:50.2637748Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2637789Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2637847Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2637945Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2638296Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2638333Z graph_break [] 2025-12-04T09:42:50.2638409Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2638541Z Compiled module path: /tmp/tmpeeyfjgtm/dv/cdve6mxy7mo7r2ih3h4nwpgl5noavz33xgl63f7jsuo5j2yu5kub.py 2025-12-04T09:42:50.2638616Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2638658Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2638716Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2638813Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2639177Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2639217Z graph_break [] 2025-12-04T09:42:50.2639290Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2639429Z Compiled module path: /tmp/tmp22kfpiuu/zv/czvhfpdlijdncjn67yvczjfyiptrzzne2eiok46ned7k4tdptdzi.py 2025-12-04T09:42:50.2639514Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2639564Z Traceback (most recent call last): 2025-12-04T09:42:50.2639709Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2639757Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2639916Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2639957Z ).run(bench_out) 2025-12-04T09:42:50.2640025Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2640071Z None UNK cufi44wmvf 2025-12-04T09:42:50.2640203Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2640349Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2640395Z ~~~~ <--- HERE 2025-12-04T09:42:50.2640536Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2640576Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2640578Z 2025-12-04T09:42:50.2640580Z 2025-12-04T09:42:50.2640660Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2640806Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2640820Z 2025-12-04T09:42:50.2640912Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2640986Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2641032Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2641087Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2641188Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2641536Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2641578Z graph_break [] 2025-12-04T09:42:50.2641653Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2641790Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2641866Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2641908Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2641966Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2642063Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2642410Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2642457Z graph_break [] 2025-12-04T09:42:50.2642533Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2642670Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2642745Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2642787Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2642846Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2642942Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2643287Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2643325Z graph_break [] 2025-12-04T09:42:50.2643421Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2643554Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2643630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2643672Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2643731Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2643829Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2644175Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2644213Z graph_break [] 2025-12-04T09:42:50.2644289Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2644425Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2644500Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2644557Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2644612Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2644713Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2645055Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2645095Z graph_break [] 2025-12-04T09:42:50.2645168Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2645311Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2645383Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2645430Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2645485Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2645584Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2645966Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2646006Z graph_break [] 2025-12-04T09:42:50.2646080Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2646234Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2646306Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2646351Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2646408Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2646507Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2646851Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2646891Z graph_break [] 2025-12-04T09:42:50.2646963Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2647098Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2647196Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2647241Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2647296Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2647397Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2647742Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2647779Z graph_break [] 2025-12-04T09:42:50.2647854Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2647990Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2648067Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2648109Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2648167Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2648279Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2648626Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2648663Z graph_break [] 2025-12-04T09:42:50.2648737Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2648870Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2648948Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2648990Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2649047Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2649144Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2649493Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2649530Z graph_break [] 2025-12-04T09:42:50.2649606Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2649738Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2649824Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2649867Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2649924Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2650021Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2650371Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2650410Z graph_break [] 2025-12-04T09:42:50.2650482Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2650622Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2650694Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2650739Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2650820Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2650919Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2651262Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2651304Z graph_break [] 2025-12-04T09:42:50.2651376Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2651515Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2651587Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2651632Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2651689Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2651788Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2652131Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2652183Z graph_break [] 2025-12-04T09:42:50.2652256Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2652387Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2652460Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2652506Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2652562Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2652665Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2653007Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2653047Z graph_break [] 2025-12-04T09:42:50.2653121Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2653259Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2653336Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2653378Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2653448Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2653546Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2653892Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2653930Z graph_break [] 2025-12-04T09:42:50.2654005Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2654141Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2654216Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2654258Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2654314Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2654412Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2654788Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2654827Z graph_break [] 2025-12-04T09:42:50.2654903Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2655039Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2655114Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2655155Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2655212Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2655309Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2655658Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2655709Z graph_break [] 2025-12-04T09:42:50.2655782Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2655918Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2656028Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2656072Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2656127Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2656226Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2656571Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2656611Z graph_break [] 2025-12-04T09:42:50.2656684Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2656816Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2656888Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2656933Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2656989Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2657089Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2657447Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2657489Z graph_break [] 2025-12-04T09:42:50.2657562Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2657700Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2657772Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2657816Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2657871Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2657970Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2658337Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2658378Z graph_break [] 2025-12-04T09:42:50.2658452Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2658589Z Compiled module path: /tmp/tmp_hj1kmqe/qt/cqtjfgww4epqrnqrizrwk4djvl5xpnfncvqc2r3qdxnxtgqxpkcl.py 2025-12-04T09:42:50.2658663Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2658705Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2658762Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2658859Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2659208Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2659246Z graph_break [] 2025-12-04T09:42:50.2659321Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2659472Z Compiled module path: /tmp/tmpxlcjsc05/ko/ckoxwsmt4vqls6z6tx63k3uruzrpei5xkiuu24xjdowqcqqvmgyp.py 2025-12-04T09:42:50.2659546Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2659863Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2659958Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2660066Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2660442Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2660490Z graph_break [] 2025-12-04T09:42:50.2660605Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2660766Z Compiled module path: /tmp/tmpfwo2vg1z/r6/cr65mqq2yuhrqjtrnsbv5n2brgxyknu52tpxcuaubovpaebdxx74.py 2025-12-04T09:42:50.2660863Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2660932Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2660997Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2661125Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2661493Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2661568Z graph_break [] 2025-12-04T09:42:50.2661653Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2661812Z Compiled module path: /tmp/tmpffjovlv2/on/convo7nb66xejt2vizpg5taq2hzkarva6kvlpikhakxj45gv5juf.py 2025-12-04T09:42:50.2661892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2661980Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2662045Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2662167Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2662523Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2662580Z graph_break [] 2025-12-04T09:42:50.2662703Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2662870Z Compiled module path: /tmp/tmpxwfglx8h/rp/crpmw273fvhq24tatgvd734xavjhkrdt6cmqwg2hmtkpxxexuppr.py 2025-12-04T09:42:50.2662953Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2663020Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2663092Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2663218Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2663591Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2663640Z graph_break [] 2025-12-04T09:42:50.2663743Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2663888Z Compiled module path: /tmp/tmp5psoaoef/dr/cdry3zflyho3wrc4yjjs7oro5tq4a7dzgs5va7zzcxb5de55fcz7.py 2025-12-04T09:42:50.2663992Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2664067Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2664148Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2664261Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2664634Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2664679Z graph_break [] 2025-12-04T09:42:50.2664794Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2664941Z Compiled module path: /tmp/tmpeeyfjgtm/dv/cdve6mxy7mo7r2ih3h4nwpgl5noavz33xgl63f7jsuo5j2yu5kub.py 2025-12-04T09:42:50.2665047Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2665108Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2665182Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2665299Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2665672Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2665728Z graph_break [] 2025-12-04T09:42:50.2665837Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2666036Z Compiled module path: /tmp/tmp22kfpiuu/zv/czvhfpdlijdncjn67yvczjfyiptrzzne2eiok46ned7k4tdptdzi.py 2025-12-04T09:42:50.2666141Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2666226Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2666291Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2666412Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2666770Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2666843Z graph_break [] 2025-12-04T09:42:50.2666932Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2667125Z Compiled module path: /tmp/tmpc1ytlvgs/5h/c5hn5kemane2up7uqe2zxrbxljjqlcs6xs5pq534ggoxivqtl7lo.py 2025-12-04T09:42:50.2667208Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2667274Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2671109Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2671218Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2671569Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2671606Z graph_break [] 2025-12-04T09:42:50.2671683Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2671825Z Compiled module path: /tmp/tmppmrqdeoa/q3/cq3a3kewslexzbdfbcrw5vv2uawnhkcmfdmbivorym7hva7r73e2.py 2025-12-04T09:42:50.2671908Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2671950Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2672005Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2672136Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2672483Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2672522Z graph_break [] 2025-12-04T09:42:50.2672595Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2672734Z Compiled module path: /tmp/tmpiak5v0ji/4z/c4zqz2p3nu7ar3n2w4ozmsk7gtbhaesxtbang4ezw7yr2t3exgb3.py 2025-12-04T09:42:50.2672810Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2672855Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2672910Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2673012Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2673355Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2673391Z graph_break [] 2025-12-04T09:42:50.2673463Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2673596Z Compiled module path: /tmp/tmputb90ews/ca/cca5udy25zmcg27vjmf36l42kvfmesf54ipj7i2begdb3lers2zw.py 2025-12-04T09:42:50.2673695Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2673745Z Traceback (most recent call last): 2025-12-04T09:42:50.2673895Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2673944Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2674084Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2674124Z ).run(bench_out) 2025-12-04T09:42:50.2674192Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2674236Z None UNK cufi44wmvf 2025-12-04T09:42:50.2674373Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2674521Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2675425Z ~~~~ <--- HERE 2025-12-04T09:42:50.2675568Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2675611Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2675613Z 2025-12-04T09:42:50.2675616Z 2025-12-04T09:42:50.2675695Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2675845Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2675849Z 2025-12-04T09:42:50.2675989Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2676064Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2676105Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2676164Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2676268Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2676620Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2676677Z graph_break [] 2025-12-04T09:42:50.2676752Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2676889Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2676961Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2677002Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2677058Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2677157Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2677500Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2677539Z graph_break [] 2025-12-04T09:42:50.2677613Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2677747Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2677820Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2677862Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2677916Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2678029Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2678377Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2678416Z graph_break [] 2025-12-04T09:42:50.2678491Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2678626Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2678701Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2678743Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2678798Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2678896Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2679279Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2679318Z graph_break [] 2025-12-04T09:42:50.2679391Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2679528Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2679598Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2679639Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2679693Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2679793Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2680141Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2680192Z graph_break [] 2025-12-04T09:42:50.2680265Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2680401Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2680472Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2680514Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2680569Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2680669Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2681014Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2681052Z graph_break [] 2025-12-04T09:42:50.2681126Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2681260Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2681332Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2681376Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2681429Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2681529Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2681876Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2681923Z graph_break [] 2025-12-04T09:42:50.2681997Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2682130Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2682201Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2682242Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2682296Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2682394Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2682756Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2682794Z graph_break [] 2025-12-04T09:42:50.2682867Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2683000Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2683073Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2683114Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2683169Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2683265Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2683611Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2683647Z graph_break [] 2025-12-04T09:42:50.2683723Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2683855Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2683940Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2683982Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2684037Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2684132Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2684476Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2684514Z graph_break [] 2025-12-04T09:42:50.2684586Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2684720Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2684793Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2684834Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2684887Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2684984Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2685326Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2685375Z graph_break [] 2025-12-04T09:42:50.2685449Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2685585Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2685656Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2685697Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2685750Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2685846Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2686230Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2686267Z graph_break [] 2025-12-04T09:42:50.2686367Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2686502Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2686574Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2686617Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2686670Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2686767Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2687114Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2687151Z graph_break [] 2025-12-04T09:42:50.2687224Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2687354Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2687426Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2687483Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2687537Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2687631Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2687973Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2688008Z graph_break [] 2025-12-04T09:42:50.2688080Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2688219Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2688292Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2688332Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2688388Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2688483Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2688826Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2688862Z graph_break [] 2025-12-04T09:42:50.2688934Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2689080Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2689153Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2689193Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2689247Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2689344Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2689689Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2689725Z graph_break [] 2025-12-04T09:42:50.2689796Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2689932Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2690031Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2690074Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2690127Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2690225Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2690565Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2690602Z graph_break [] 2025-12-04T09:42:50.2690673Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2690806Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2690879Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2690920Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2690974Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2691071Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2691431Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2691468Z graph_break [] 2025-12-04T09:42:50.2691540Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2691669Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2691741Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2691784Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2691837Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2691935Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2692278Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2692314Z graph_break [] 2025-12-04T09:42:50.2692385Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2692519Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2692591Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2692641Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2692697Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2692793Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2693134Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2693169Z graph_break [] 2025-12-04T09:42:50.2693242Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2693373Z Compiled module path: /tmp/tmp_hj1kmqe/qt/cqtjfgww4epqrnqrizrwk4djvl5xpnfncvqc2r3qdxnxtgqxpkcl.py 2025-12-04T09:42:50.2693445Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2693486Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2693560Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2693656Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2693999Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2694036Z graph_break [] 2025-12-04T09:42:50.2694109Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2694243Z Compiled module path: /tmp/tmpxlcjsc05/ko/ckoxwsmt4vqls6z6tx63k3uruzrpei5xkiuu24xjdowqcqqvmgyp.py 2025-12-04T09:42:50.2694315Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2694355Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2694410Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2694507Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2694850Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2694953Z graph_break [] 2025-12-04T09:42:50.2695027Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2695160Z Compiled module path: /tmp/tmpfwo2vg1z/r6/cr65mqq2yuhrqjtrnsbv5n2brgxyknu52tpxcuaubovpaebdxx74.py 2025-12-04T09:42:50.2695232Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2695272Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2695328Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2695425Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2695771Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2695808Z graph_break [] 2025-12-04T09:42:50.2695879Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2696052Z Compiled module path: /tmp/tmpffjovlv2/on/convo7nb66xejt2vizpg5taq2hzkarva6kvlpikhakxj45gv5juf.py 2025-12-04T09:42:50.2696123Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2696165Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2696219Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2696332Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2696673Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2696711Z graph_break [] 2025-12-04T09:42:50.2696782Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2696917Z Compiled module path: /tmp/tmpxwfglx8h/rp/crpmw273fvhq24tatgvd734xavjhkrdt6cmqwg2hmtkpxxexuppr.py 2025-12-04T09:42:50.2696988Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2697029Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2697082Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2697180Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2697546Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2697584Z graph_break [] 2025-12-04T09:42:50.2697656Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2697791Z Compiled module path: /tmp/tmp5psoaoef/dr/cdry3zflyho3wrc4yjjs7oro5tq4a7dzgs5va7zzcxb5de55fcz7.py 2025-12-04T09:42:50.2697862Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2697903Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2697956Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2698054Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2698398Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2698449Z graph_break [] 2025-12-04T09:42:50.2698521Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2698652Z Compiled module path: /tmp/tmpeeyfjgtm/dv/cdve6mxy7mo7r2ih3h4nwpgl5noavz33xgl63f7jsuo5j2yu5kub.py 2025-12-04T09:42:50.2698725Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2698766Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2698820Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2698915Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2699262Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2699298Z graph_break [] 2025-12-04T09:42:50.2699371Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2699506Z Compiled module path: /tmp/tmp22kfpiuu/zv/czvhfpdlijdncjn67yvczjfyiptrzzne2eiok46ned7k4tdptdzi.py 2025-12-04T09:42:50.2699579Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2699620Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2699674Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2699770Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2700116Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2700162Z graph_break [] 2025-12-04T09:42:50.2700235Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2700369Z Compiled module path: /tmp/tmpc1ytlvgs/5h/c5hn5kemane2up7uqe2zxrbxljjqlcs6xs5pq534ggoxivqtl7lo.py 2025-12-04T09:42:50.2700442Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2700482Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2700536Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2700632Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2700998Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2701036Z graph_break [] 2025-12-04T09:42:50.2701108Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2701245Z Compiled module path: /tmp/tmppmrqdeoa/q3/cq3a3kewslexzbdfbcrw5vv2uawnhkcmfdmbivorym7hva7r73e2.py 2025-12-04T09:42:50.2701316Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2701357Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2701411Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2701507Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2701852Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2701889Z graph_break [] 2025-12-04T09:42:50.2701961Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2702111Z Compiled module path: /tmp/tmpiak5v0ji/4z/c4zqz2p3nu7ar3n2w4ozmsk7gtbhaesxtbang4ezw7yr2t3exgb3.py 2025-12-04T09:42:50.2702183Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2702224Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2702277Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2702374Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2702719Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2702758Z graph_break [] 2025-12-04T09:42:50.2702829Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2702961Z Compiled module path: /tmp/tmputb90ews/ca/cca5udy25zmcg27vjmf36l42kvfmesf54ipj7i2begdb3lers2zw.py 2025-12-04T09:42:50.2703035Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2703076Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2703129Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2703225Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2703566Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2703620Z graph_break [] 2025-12-04T09:42:50.2703694Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2703829Z Compiled module path: /tmp/tmpyz8phy1t/jn/cjnln5dtyols5eyqvhdeqcoweuntrzgbftw3xboyqn3izxynbtun.py 2025-12-04T09:42:50.2703903Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2703945Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2704001Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2704097Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2704439Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2704475Z graph_break [] 2025-12-04T09:42:50.2704571Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2704704Z Compiled module path: /tmp/tmpbbugkt6v/3z/c3zgvjwsa76qdup6wal5bvf3ccdero3a2m5biswvmdy4zaciigj7.py 2025-12-04T09:42:50.2704777Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2704818Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2704873Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2704968Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2705312Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2705348Z graph_break [] 2025-12-04T09:42:50.2705421Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2705555Z Compiled module path: /tmp/tmpb8t1ft7q/sv/csvnn6axisvhezh36eq2emnmssqzcqy6lwvcn6uxfmhxldxx6rku.py 2025-12-04T09:42:50.2705640Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2705700Z Traceback (most recent call last): 2025-12-04T09:42:50.2705847Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2705891Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2706074Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2706110Z ).run(bench_out) 2025-12-04T09:42:50.2706179Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2706222Z None UNK cufi44wmvf 2025-12-04T09:42:50.2706357Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2706501Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2706548Z ~~~~ <--- HERE 2025-12-04T09:42:50.2706687Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2706725Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2706727Z 2025-12-04T09:42:50.2706729Z 2025-12-04T09:42:50.2706806Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2706954Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2706956Z 2025-12-04T09:42:50.2707044Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2707134Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2707177Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2707232Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2707334Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2707677Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2707714Z graph_break [] 2025-12-04T09:42:50.2707786Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2707924Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2708024Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2708066Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2708122Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2708220Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2708566Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2708603Z graph_break [] 2025-12-04T09:42:50.2708677Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2708812Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2708885Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2708927Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2708983Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2709078Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2709435Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2709470Z graph_break [] 2025-12-04T09:42:50.2709543Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2709676Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2709749Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2709791Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2709848Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2709943Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2710286Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2710323Z graph_break [] 2025-12-04T09:42:50.2710397Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2710530Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2710602Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2710653Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2710711Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2710806Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2711153Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2711191Z graph_break [] 2025-12-04T09:42:50.2711265Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2711401Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2711474Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2711516Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2711571Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2711685Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2712027Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2712065Z graph_break [] 2025-12-04T09:42:50.2712136Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2712273Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2712344Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2712385Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2712439Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2712542Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2712887Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2712936Z graph_break [] 2025-12-04T09:42:50.2713008Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2713141Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2713212Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2713254Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2713307Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2713406Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2713749Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2713788Z graph_break [] 2025-12-04T09:42:50.2713861Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2713995Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2714066Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2714108Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2714162Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2714260Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2714618Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2714655Z graph_break [] 2025-12-04T09:42:50.2714729Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2714862Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2714934Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2714974Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2715029Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2715125Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2715489Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2715526Z graph_break [] 2025-12-04T09:42:50.2715599Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2715732Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2715804Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2715845Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2715900Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2716039Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2716385Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2716421Z graph_break [] 2025-12-04T09:42:50.2716518Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2716653Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2716727Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2716767Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2716822Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2716917Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2717263Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2717301Z graph_break [] 2025-12-04T09:42:50.2717373Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2717510Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2717581Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2717622Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2717676Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2717773Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2718117Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2718178Z graph_break [] 2025-12-04T09:42:50.2718249Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2718379Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2718450Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2718493Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2718546Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2718644Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2719014Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2719052Z graph_break [] 2025-12-04T09:42:50.2719124Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2719262Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2719336Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2719379Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2719432Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2719529Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2719869Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2719908Z graph_break [] 2025-12-04T09:42:50.2719981Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2720115Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2720202Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2720242Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2720298Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2720393Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2720740Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2720778Z graph_break [] 2025-12-04T09:42:50.2720851Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2720986Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2721060Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2721100Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2721155Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2721251Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2721591Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2721637Z graph_break [] 2025-12-04T09:42:50.2721712Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2721844Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2721916Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2721957Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2722012Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2722108Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2722451Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2722487Z graph_break [] 2025-12-04T09:42:50.2722561Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2722709Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2722781Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2722825Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2722879Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2722976Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2723315Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2723351Z graph_break [] 2025-12-04T09:42:50.2723423Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2723559Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2723630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2723672Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2723739Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2723836Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2724178Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2724215Z graph_break [] 2025-12-04T09:42:50.2724287Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2724423Z Compiled module path: /tmp/tmp_hj1kmqe/qt/cqtjfgww4epqrnqrizrwk4djvl5xpnfncvqc2r3qdxnxtgqxpkcl.py 2025-12-04T09:42:50.2724495Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2724537Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2724592Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2724692Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2725039Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2725076Z graph_break [] 2025-12-04T09:42:50.2725147Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2725294Z Compiled module path: /tmp/tmpxlcjsc05/ko/ckoxwsmt4vqls6z6tx63k3uruzrpei5xkiuu24xjdowqcqqvmgyp.py 2025-12-04T09:42:50.2725367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2725408Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2725463Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2725560Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2725903Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2725980Z graph_break [] 2025-12-04T09:42:50.2726053Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2726187Z Compiled module path: /tmp/tmpfwo2vg1z/r6/cr65mqq2yuhrqjtrnsbv5n2brgxyknu52tpxcuaubovpaebdxx74.py 2025-12-04T09:42:50.2726286Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2726326Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2726381Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2726477Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2726822Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2726858Z graph_break [] 2025-12-04T09:42:50.2726933Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2727068Z Compiled module path: /tmp/tmpffjovlv2/on/convo7nb66xejt2vizpg5taq2hzkarva6kvlpikhakxj45gv5juf.py 2025-12-04T09:42:50.2727141Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2727183Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2727238Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2727334Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2727693Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2727727Z graph_break [] 2025-12-04T09:42:50.2727801Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2727934Z Compiled module path: /tmp/tmpxwfglx8h/rp/crpmw273fvhq24tatgvd734xavjhkrdt6cmqwg2hmtkpxxexuppr.py 2025-12-04T09:42:50.2728007Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2728050Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2728105Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2728200Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2728545Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2728582Z graph_break [] 2025-12-04T09:42:50.2728654Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2728789Z Compiled module path: /tmp/tmp5psoaoef/dr/cdry3zflyho3wrc4yjjs7oro5tq4a7dzgs5va7zzcxb5de55fcz7.py 2025-12-04T09:42:50.2728861Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2728914Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2728970Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2729066Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2729412Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2729450Z graph_break [] 2025-12-04T09:42:50.2729521Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2729652Z Compiled module path: /tmp/tmpeeyfjgtm/dv/cdve6mxy7mo7r2ih3h4nwpgl5noavz33xgl63f7jsuo5j2yu5kub.py 2025-12-04T09:42:50.2729724Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2729767Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2729820Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2729946Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2730288Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2730326Z graph_break [] 2025-12-04T09:42:50.2730396Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2730534Z Compiled module path: /tmp/tmp22kfpiuu/zv/czvhfpdlijdncjn67yvczjfyiptrzzne2eiok46ned7k4tdptdzi.py 2025-12-04T09:42:50.2730606Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2730648Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2730703Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2730801Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2731144Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2731192Z graph_break [] 2025-12-04T09:42:50.2731264Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2731397Z Compiled module path: /tmp/tmpc1ytlvgs/5h/c5hn5kemane2up7uqe2zxrbxljjqlcs6xs5pq534ggoxivqtl7lo.py 2025-12-04T09:42:50.2731469Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2731510Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2731564Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2731662Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2732004Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2732041Z graph_break [] 2025-12-04T09:42:50.2732114Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2732249Z Compiled module path: /tmp/tmppmrqdeoa/q3/cq3a3kewslexzbdfbcrw5vv2uawnhkcmfdmbivorym7hva7r73e2.py 2025-12-04T09:42:50.2732321Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2732361Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2732415Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2732522Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2732868Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2732904Z graph_break [] 2025-12-04T09:42:50.2732977Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2733108Z Compiled module path: /tmp/tmpiak5v0ji/4z/c4zqz2p3nu7ar3n2w4ozmsk7gtbhaesxtbang4ezw7yr2t3exgb3.py 2025-12-04T09:42:50.2733182Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2733222Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2733277Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2733372Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2733811Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2733849Z graph_break [] 2025-12-04T09:42:50.2733921Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2734053Z Compiled module path: /tmp/tmputb90ews/ca/cca5udy25zmcg27vjmf36l42kvfmesf54ipj7i2begdb3lers2zw.py 2025-12-04T09:42:50.2734124Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2734165Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2734218Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2734314Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2734659Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2734712Z graph_break [] 2025-12-04T09:42:50.2734783Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2734919Z Compiled module path: /tmp/tmpyz8phy1t/jn/cjnln5dtyols5eyqvhdeqcoweuntrzgbftw3xboyqn3izxynbtun.py 2025-12-04T09:42:50.2734990Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2735032Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2735085Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2735181Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2735529Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2735567Z graph_break [] 2025-12-04T09:42:50.2735640Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2735774Z Compiled module path: /tmp/tmpbbugkt6v/3z/c3zgvjwsa76qdup6wal5bvf3ccdero3a2m5biswvmdy4zaciigj7.py 2025-12-04T09:42:50.2735845Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2735887Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2735988Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2736085Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2736429Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2736480Z graph_break [] 2025-12-04T09:42:50.2736553Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2736689Z Compiled module path: /tmp/tmpb8t1ft7q/sv/csvnn6axisvhezh36eq2emnmssqzcqy6lwvcn6uxfmhxldxx6rku.py 2025-12-04T09:42:50.2736761Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2736801Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2736856Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2736952Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2737322Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2737359Z graph_break [] 2025-12-04T09:42:50.2737432Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2737565Z Compiled module path: /tmp/tmpuake4uz1/is/cis2aq453bnnv4wrn5apdfizeefau53c7ytdpi7m4y7hvo6z3dck.py 2025-12-04T09:42:50.2737650Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2737696Z Traceback (most recent call last): 2025-12-04T09:42:50.2737840Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2737884Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2738024Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2738060Z ).run(bench_out) 2025-12-04T09:42:50.2738129Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2738176Z None UNK cufi44wmvf 2025-12-04T09:42:50.2738309Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2738464Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2738511Z ~~~~ <--- HERE 2025-12-04T09:42:50.2738649Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2738689Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2738691Z 2025-12-04T09:42:50.2738693Z 2025-12-04T09:42:50.2738768Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2738917Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2738919Z 2025-12-04T09:42:50.2739007Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2739080Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2739124Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2739178Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2739276Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2739622Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2739659Z graph_break [] 2025-12-04T09:42:50.2739741Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2739882Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2739954Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2739999Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2740053Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2740151Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2740494Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2740531Z graph_break [] 2025-12-04T09:42:50.2740603Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2740759Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2740831Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2740873Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2740927Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2741024Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2741368Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2741404Z graph_break [] 2025-12-04T09:42:50.2741475Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2741611Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2741684Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2741725Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2741779Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2741891Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2742234Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2742270Z graph_break [] 2025-12-04T09:42:50.2742344Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2742478Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2742553Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2742594Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2742648Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2742746Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2743088Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2743124Z graph_break [] 2025-12-04T09:42:50.2743196Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2743331Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2743417Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2743458Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2743513Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2743608Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2743953Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2743988Z graph_break [] 2025-12-04T09:42:50.2744061Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2744196Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2744270Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2744347Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2744404Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2744500Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2744846Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2744884Z graph_break [] 2025-12-04T09:42:50.2744955Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2745088Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2745160Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2745206Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2745262Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2745361Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2745716Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2745754Z graph_break [] 2025-12-04T09:42:50.2745825Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2746019Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2746092Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2746134Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2746190Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2746290Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2746636Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2746675Z graph_break [] 2025-12-04T09:42:50.2746747Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2746882Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2746953Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2746997Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2747069Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2747171Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2747517Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2747554Z graph_break [] 2025-12-04T09:42:50.2747629Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2747762Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2747832Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2747874Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2747930Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2748056Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2748400Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2748437Z graph_break [] 2025-12-04T09:42:50.2748510Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2748645Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2748719Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2748759Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2748815Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2748917Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2749263Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2749315Z graph_break [] 2025-12-04T09:42:50.2749388Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2749522Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2749593Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2749634Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2749691Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2749787Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2750133Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2750172Z graph_break [] 2025-12-04T09:42:50.2750244Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2750372Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2750445Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2750486Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2750539Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2750636Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2750993Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2751033Z graph_break [] 2025-12-04T09:42:50.2751105Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2751243Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2751314Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2751355Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2751408Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2751504Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2751871Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2751908Z graph_break [] 2025-12-04T09:42:50.2751981Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2752118Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2752189Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2752231Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2752284Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2752383Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2752730Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2752769Z graph_break [] 2025-12-04T09:42:50.2752841Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2752988Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2753062Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2753103Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2753159Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2753255Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2753601Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2753637Z graph_break [] 2025-12-04T09:42:50.2753710Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2753845Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2753918Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2753958Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2754013Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2754108Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2754453Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2754501Z graph_break [] 2025-12-04T09:42:50.2754576Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2754707Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2754780Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2754821Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2754875Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2754971Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2755314Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2755369Z graph_break [] 2025-12-04T09:42:50.2755442Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2755576Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2755649Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2755690Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2755743Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2755839Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2756222Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2756261Z graph_break [] 2025-12-04T09:42:50.2756335Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2756468Z Compiled module path: /tmp/tmp_hj1kmqe/qt/cqtjfgww4epqrnqrizrwk4djvl5xpnfncvqc2r3qdxnxtgqxpkcl.py 2025-12-04T09:42:50.2756560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2756601Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2756655Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2756752Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2757103Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2757142Z graph_break [] 2025-12-04T09:42:50.2757216Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2757351Z Compiled module path: /tmp/tmpxlcjsc05/ko/ckoxwsmt4vqls6z6tx63k3uruzrpei5xkiuu24xjdowqcqqvmgyp.py 2025-12-04T09:42:50.2757424Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2757469Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2757522Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2757621Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2757963Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2758023Z graph_break [] 2025-12-04T09:42:50.2758098Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2758232Z Compiled module path: /tmp/tmpfwo2vg1z/r6/cr65mqq2yuhrqjtrnsbv5n2brgxyknu52tpxcuaubovpaebdxx74.py 2025-12-04T09:42:50.2758303Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2758345Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2758398Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2758498Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2758843Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2758878Z graph_break [] 2025-12-04T09:42:50.2758953Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2759113Z Compiled module path: /tmp/tmpffjovlv2/on/convo7nb66xejt2vizpg5taq2hzkarva6kvlpikhakxj45gv5juf.py 2025-12-04T09:42:50.2759187Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2759230Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2759285Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2759381Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2759726Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2759761Z graph_break [] 2025-12-04T09:42:50.2759833Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2759971Z Compiled module path: /tmp/tmpxwfglx8h/rp/crpmw273fvhq24tatgvd734xavjhkrdt6cmqwg2hmtkpxxexuppr.py 2025-12-04T09:42:50.2760045Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2760086Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2760156Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2760252Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2760598Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2760633Z graph_break [] 2025-12-04T09:42:50.2760708Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2760842Z Compiled module path: /tmp/tmp5psoaoef/dr/cdry3zflyho3wrc4yjjs7oro5tq4a7dzgs5va7zzcxb5de55fcz7.py 2025-12-04T09:42:50.2760915Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2760955Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2761011Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2761109Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2761457Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2761495Z graph_break [] 2025-12-04T09:42:50.2761566Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2761699Z Compiled module path: /tmp/tmpeeyfjgtm/dv/cdve6mxy7mo7r2ih3h4nwpgl5noavz33xgl63f7jsuo5j2yu5kub.py 2025-12-04T09:42:50.2761784Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2761826Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2761880Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2761979Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2762322Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2762359Z graph_break [] 2025-12-04T09:42:50.2762431Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2762567Z Compiled module path: /tmp/tmp22kfpiuu/zv/czvhfpdlijdncjn67yvczjfyiptrzzne2eiok46ned7k4tdptdzi.py 2025-12-04T09:42:50.2762659Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2762702Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2762755Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2762851Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2763195Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2763231Z graph_break [] 2025-12-04T09:42:50.2763302Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2763440Z Compiled module path: /tmp/tmpc1ytlvgs/5h/c5hn5kemane2up7uqe2zxrbxljjqlcs6xs5pq534ggoxivqtl7lo.py 2025-12-04T09:42:50.2763513Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2763557Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2763611Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2763710Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2764063Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2764100Z graph_break [] 2025-12-04T09:42:50.2764173Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2764309Z Compiled module path: /tmp/tmppmrqdeoa/q3/cq3a3kewslexzbdfbcrw5vv2uawnhkcmfdmbivorym7hva7r73e2.py 2025-12-04T09:42:50.2764382Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2764423Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2764480Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2764576Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2764921Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2764956Z graph_break [] 2025-12-04T09:42:50.2765032Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2765163Z Compiled module path: /tmp/tmpiak5v0ji/4z/c4zqz2p3nu7ar3n2w4ozmsk7gtbhaesxtbang4ezw7yr2t3exgb3.py 2025-12-04T09:42:50.2765237Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2765287Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2765346Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2765443Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2765787Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2765823Z graph_break [] 2025-12-04T09:42:50.2765896Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2766067Z Compiled module path: /tmp/tmputb90ews/ca/cca5udy25zmcg27vjmf36l42kvfmesf54ipj7i2begdb3lers2zw.py 2025-12-04T09:42:50.2766140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2766179Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2766235Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2766354Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2766697Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2766735Z graph_break [] 2025-12-04T09:42:50.2766808Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2766945Z Compiled module path: /tmp/tmpyz8phy1t/jn/cjnln5dtyols5eyqvhdeqcoweuntrzgbftw3xboyqn3izxynbtun.py 2025-12-04T09:42:50.2767015Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2767058Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2767113Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2767211Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2767556Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2767610Z graph_break [] 2025-12-04T09:42:50.2767682Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2767817Z Compiled module path: /tmp/tmpbbugkt6v/3z/c3zgvjwsa76qdup6wal5bvf3ccdero3a2m5biswvmdy4zaciigj7.py 2025-12-04T09:42:50.2767887Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2767930Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2767985Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2768084Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2768425Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2768462Z graph_break [] 2025-12-04T09:42:50.2768533Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2768668Z Compiled module path: /tmp/tmpb8t1ft7q/sv/csvnn6axisvhezh36eq2emnmssqzcqy6lwvcn6uxfmhxldxx6rku.py 2025-12-04T09:42:50.2768738Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2768781Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2768835Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2768933Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2769293Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2769332Z graph_break [] 2025-12-04T09:42:50.2769403Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2769538Z Compiled module path: /tmp/tmpuake4uz1/is/cis2aq453bnnv4wrn5apdfizeefau53c7ytdpi7m4y7hvo6z3dck.py 2025-12-04T09:42:50.2769611Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2769651Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2769707Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2769804Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2770177Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2770213Z graph_break [] 2025-12-04T09:42:50.2770287Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2770418Z Compiled module path: /tmp/tmp_5sp4e1b/nj/cnjjvf2hskj64abhf3tcw6wgkrtmet7smjrbe6zg4om5nzlujwzg.py 2025-12-04T09:42:50.2770490Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2770532Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2770588Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2770683Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2771030Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2771065Z graph_break [] 2025-12-04T09:42:50.2771161Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2771296Z Compiled module path: /tmp/tmpgrsri9x0/fw/cfwpxxh62dgwp44s2won22nlpb5vbdsypcxk22gyrjqo6hkmvro2.py 2025-12-04T09:42:50.2771383Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2771429Z Traceback (most recent call last): 2025-12-04T09:42:50.2771576Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2771621Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2771764Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2771801Z ).run(bench_out) 2025-12-04T09:42:50.2771870Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2771913Z None UNK cufi44wmvf 2025-12-04T09:42:50.2772046Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2772188Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2772236Z ~~~~ <--- HERE 2025-12-04T09:42:50.2772377Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2772415Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2772417Z 2025-12-04T09:42:50.2772429Z 2025-12-04T09:42:50.2772507Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2772656Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2772658Z 2025-12-04T09:42:50.2772746Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2772820Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2772864Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2772919Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2773022Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2773367Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2773405Z graph_break [] 2025-12-04T09:42:50.2773499Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2773639Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2773711Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2773753Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2773806Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2773905Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2774247Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2774287Z graph_break [] 2025-12-04T09:42:50.2774362Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2774500Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2774571Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2774626Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2774679Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2774779Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2775123Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2775160Z graph_break [] 2025-12-04T09:42:50.2775233Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2775367Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2775440Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2775483Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2775538Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2775635Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2776026Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2776062Z graph_break [] 2025-12-04T09:42:50.2776134Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2776284Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2776357Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2776397Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2776456Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2776552Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2776902Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2776937Z graph_break [] 2025-12-04T09:42:50.2777012Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2777172Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2777246Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2777287Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2777341Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2777438Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2777780Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2777816Z graph_break [] 2025-12-04T09:42:50.2777888Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2778026Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2778099Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2778141Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2778193Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2778305Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2778649Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2778685Z graph_break [] 2025-12-04T09:42:50.2778758Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2778892Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2778967Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2779008Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2779061Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2779158Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2779507Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2779545Z graph_break [] 2025-12-04T09:42:50.2779617Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2779751Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2779833Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2779877Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2779931Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2780028Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2780377Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2780414Z graph_break [] 2025-12-04T09:42:50.2780486Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2780619Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2780691Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2780735Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2780808Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2780906Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2781253Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2781290Z graph_break [] 2025-12-04T09:42:50.2781365Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2781497Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2781571Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2781615Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2781674Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2781770Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2782117Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2782167Z graph_break [] 2025-12-04T09:42:50.2782240Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2782375Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2782447Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2782490Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2782545Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2782643Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2782986Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2783027Z graph_break [] 2025-12-04T09:42:50.2783099Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2783236Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2783307Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2783351Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2783416Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2783516Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2783859Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2783899Z graph_break [] 2025-12-04T09:42:50.2783971Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2784101Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2784172Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2784214Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2784268Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2784393Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2784737Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2784777Z graph_break [] 2025-12-04T09:42:50.2784849Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2784986Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2785060Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2785102Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2785156Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2785255Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2785596Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2785643Z graph_break [] 2025-12-04T09:42:50.2785714Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2785853Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2785986Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2786029Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2786085Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2786182Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2786529Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2786567Z graph_break [] 2025-12-04T09:42:50.2786640Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2786777Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2786849Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2786890Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2786946Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2787042Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2787408Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2787447Z graph_break [] 2025-12-04T09:42:50.2787522Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2787654Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2787727Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2787767Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2787823Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2787919Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2788300Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2788338Z graph_break [] 2025-12-04T09:42:50.2788413Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2788540Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2788614Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2788654Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2788711Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2788805Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2789151Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2789193Z graph_break [] 2025-12-04T09:42:50.2789266Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2789460Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2789540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2789584Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2789637Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2789737Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2790082Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2790119Z graph_break [] 2025-12-04T09:42:50.2790191Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2790327Z Compiled module path: /tmp/tmp_hj1kmqe/qt/cqtjfgww4epqrnqrizrwk4djvl5xpnfncvqc2r3qdxnxtgqxpkcl.py 2025-12-04T09:42:50.2790399Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2790463Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2790522Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2790621Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2790966Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2791016Z graph_break [] 2025-12-04T09:42:50.2791089Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2791225Z Compiled module path: /tmp/tmpxlcjsc05/ko/ckoxwsmt4vqls6z6tx63k3uruzrpei5xkiuu24xjdowqcqqvmgyp.py 2025-12-04T09:42:50.2791298Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2791341Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2791395Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2791494Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2791844Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2791902Z graph_break [] 2025-12-04T09:42:50.2791976Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2792111Z Compiled module path: /tmp/tmpfwo2vg1z/r6/cr65mqq2yuhrqjtrnsbv5n2brgxyknu52tpxcuaubovpaebdxx74.py 2025-12-04T09:42:50.2792186Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2792227Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2792281Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2792376Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2792721Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2792759Z graph_break [] 2025-12-04T09:42:50.2792834Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2792968Z Compiled module path: /tmp/tmpffjovlv2/on/convo7nb66xejt2vizpg5taq2hzkarva6kvlpikhakxj45gv5juf.py 2025-12-04T09:42:50.2793055Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2793097Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2793152Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2793248Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2793593Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2793631Z graph_break [] 2025-12-04T09:42:50.2793707Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2793841Z Compiled module path: /tmp/tmpxwfglx8h/rp/crpmw273fvhq24tatgvd734xavjhkrdt6cmqwg2hmtkpxxexuppr.py 2025-12-04T09:42:50.2793915Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2793958Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2794015Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2794112Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2794461Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2794499Z graph_break [] 2025-12-04T09:42:50.2794583Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2794720Z Compiled module path: /tmp/tmp5psoaoef/dr/cdry3zflyho3wrc4yjjs7oro5tq4a7dzgs5va7zzcxb5de55fcz7.py 2025-12-04T09:42:50.2794791Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2794833Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2794886Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2794984Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2795325Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2795363Z graph_break [] 2025-12-04T09:42:50.2795433Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2795586Z Compiled module path: /tmp/tmpeeyfjgtm/dv/cdve6mxy7mo7r2ih3h4nwpgl5noavz33xgl63f7jsuo5j2yu5kub.py 2025-12-04T09:42:50.2795657Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2795700Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2795754Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2795854Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2796244Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2796283Z graph_break [] 2025-12-04T09:42:50.2796355Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2796495Z Compiled module path: /tmp/tmp22kfpiuu/zv/czvhfpdlijdncjn67yvczjfyiptrzzne2eiok46ned7k4tdptdzi.py 2025-12-04T09:42:50.2796566Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2796609Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2796663Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2796786Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2797129Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2797168Z graph_break [] 2025-12-04T09:42:50.2797242Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2797376Z Compiled module path: /tmp/tmpc1ytlvgs/5h/c5hn5kemane2up7uqe2zxrbxljjqlcs6xs5pq534ggoxivqtl7lo.py 2025-12-04T09:42:50.2797453Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2797494Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2797551Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2797648Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2797991Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2798027Z graph_break [] 2025-12-04T09:42:50.2798103Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2798240Z Compiled module path: /tmp/tmppmrqdeoa/q3/cq3a3kewslexzbdfbcrw5vv2uawnhkcmfdmbivorym7hva7r73e2.py 2025-12-04T09:42:50.2798331Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2798372Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2798427Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2798524Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2798870Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2798906Z graph_break [] 2025-12-04T09:42:50.2798978Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2799111Z Compiled module path: /tmp/tmpiak5v0ji/4z/c4zqz2p3nu7ar3n2w4ozmsk7gtbhaesxtbang4ezw7yr2t3exgb3.py 2025-12-04T09:42:50.2799184Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2799257Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2799316Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2799411Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2799757Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2799794Z graph_break [] 2025-12-04T09:42:50.2799867Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2800000Z Compiled module path: /tmp/tmputb90ews/ca/cca5udy25zmcg27vjmf36l42kvfmesf54ipj7i2begdb3lers2zw.py 2025-12-04T09:42:50.2800072Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2800115Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2800169Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2800266Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2800624Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2800661Z graph_break [] 2025-12-04T09:42:50.2800734Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2800870Z Compiled module path: /tmp/tmpyz8phy1t/jn/cjnln5dtyols5eyqvhdeqcoweuntrzgbftw3xboyqn3izxynbtun.py 2025-12-04T09:42:50.2800942Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2800985Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2801041Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2801139Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2801480Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2801519Z graph_break [] 2025-12-04T09:42:50.2801591Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2801727Z Compiled module path: /tmp/tmpbbugkt6v/3z/c3zgvjwsa76qdup6wal5bvf3ccdero3a2m5biswvmdy4zaciigj7.py 2025-12-04T09:42:50.2801797Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2801841Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2801906Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2802006Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2802350Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2802391Z graph_break [] 2025-12-04T09:42:50.2802463Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2802599Z Compiled module path: /tmp/tmpb8t1ft7q/sv/csvnn6axisvhezh36eq2emnmssqzcqy6lwvcn6uxfmhxldxx6rku.py 2025-12-04T09:42:50.2802670Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2802712Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2802766Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2802883Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2803228Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2803265Z graph_break [] 2025-12-04T09:42:50.2803337Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2803471Z Compiled module path: /tmp/tmpuake4uz1/is/cis2aq453bnnv4wrn5apdfizeefau53c7ytdpi7m4y7hvo6z3dck.py 2025-12-04T09:42:50.2803546Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2803587Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2803643Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2803742Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2804088Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2804137Z graph_break [] 2025-12-04T09:42:50.2804211Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2804341Z Compiled module path: /tmp/tmp_5sp4e1b/nj/cnjjvf2hskj64abhf3tcw6wgkrtmet7smjrbe6zg4om5nzlujwzg.py 2025-12-04T09:42:50.2804415Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2804458Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2804513Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2804610Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2804958Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2804996Z graph_break [] 2025-12-04T09:42:50.2805069Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2805204Z Compiled module path: /tmp/tmpgrsri9x0/fw/cfwpxxh62dgwp44s2won22nlpb5vbdsypcxk22gyrjqo6hkmvro2.py 2025-12-04T09:42:50.2805281Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2805321Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2805376Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2805472Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2805828Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2805867Z graph_break [] 2025-12-04T09:42:50.2805980Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2806117Z Compiled module path: /tmp/tmpfphthme4/y3/cy3lbra7px6alzpx2jj7ub7lvbinwfejqdevqsgiwnipuox7vje2.py 2025-12-04T09:42:50.2806189Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2806230Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2806284Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2806383Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2806770Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2806809Z graph_break [] 2025-12-04T09:42:50.2806883Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2807014Z Compiled module path: /tmp/tmp_mo_qyhr/cd/ccdskbg4cigl5o4msvv6w22so7lwnpns4qllesqanqxbqio3g2to.py 2025-12-04T09:42:50.2807085Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2807127Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2807181Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2807280Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2807625Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2807663Z graph_break [] 2025-12-04T09:42:50.2807734Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2807891Z Compiled module path: /tmp/tmp3c0whvmo/fp/cfpnkdotiotnfh3ufogublj64f5pyhjwrbjkdv2so3lkiiycnppv.py 2025-12-04T09:42:50.2807974Z _________________ TestKernelBenchmark.test_pw_kernel_benchmark _________________ 2025-12-04T09:42:50.2808022Z Traceback (most recent call last): 2025-12-04T09:42:50.2808167Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 145, in test_pw_kernel_benchmark 2025-12-04T09:42:50.2808213Z self.verify_compiled_kernels() 2025-12-04T09:42:50.2808354Z File "/var/lib/jenkins/pytorch/test/inductor/test_kernel_benchmark.py", line 78, in verify_compiled_kernels 2025-12-04T09:42:50.2808396Z ).run(bench_out) 2025-12-04T09:42:50.2808464Z RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2808507Z None UNK cufi44wmvf 2025-12-04T09:42:50.2808640Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2808787Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2808834Z ~~~~ <--- HERE 2025-12-04T09:42:50.2808975Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2809015Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2809016Z 2025-12-04T09:42:50.2809018Z 2025-12-04T09:42:50.2809114Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2809264Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2809266Z 2025-12-04T09:42:50.2809355Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2809431Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2809473Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2809530Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2809631Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2809978Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2810016Z graph_break [] 2025-12-04T09:42:50.2810108Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2810246Z Compiled module path: /tmp/tmpjyagpq6g/qq/cqqqam77cvsbnzeardogw6uildbimud2qhthzmcvylla5v2wpnry.py 2025-12-04T09:42:50.2810320Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2810362Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2810419Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2810515Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2810862Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2810902Z graph_break [] 2025-12-04T09:42:50.2810975Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2811115Z Compiled module path: /tmp/tmpi9yru0cs/bk/cbkr7pexvzj2w7gcqtgcxofgao3h5fnvojawmlabpbm32bk5yut7.py 2025-12-04T09:42:50.2811187Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2811247Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2811302Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2811401Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2811741Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2811780Z graph_break [] 2025-12-04T09:42:50.2811853Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2811990Z Compiled module path: /tmp/tmpi6bzr380/q4/cq46oeupozwvr2vbym67psnom5cy62y2xctngqkpqou2n6hfqaej.py 2025-12-04T09:42:50.2812060Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2812102Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2812157Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2812254Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2812594Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2812630Z graph_break [] 2025-12-04T09:42:50.2812702Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2812858Z Compiled module path: /tmp/tmphumg5l9r/id/cidkj46tlt2zoao3gvqjm4qsyxezlqnjgalgiw27hn7bfesybjvn.py 2025-12-04T09:42:50.2812929Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2812971Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2813027Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2813127Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2813469Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2813506Z graph_break [] 2025-12-04T09:42:50.2813580Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2813720Z Compiled module path: /tmp/tmpxkct2flv/ef/cefq7oabaz3whwsgknvylnetmndhnzeyjcrkapds42khdsgchsui.py 2025-12-04T09:42:50.2813814Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2813856Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2813914Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2814012Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2814356Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2814391Z graph_break [] 2025-12-04T09:42:50.2814465Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2814600Z Compiled module path: /tmp/tmpgg2qm8mx/r6/cr6ibyho2xquvykv7uk5pzeidgofvcil6wfdtzbieqqvtnq5cefy.py 2025-12-04T09:42:50.2814676Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2814718Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2814774Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2814885Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2815232Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2815267Z graph_break [] 2025-12-04T09:42:50.2815341Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2815474Z Compiled module path: /tmp/tmpch7cg2js/ss/cssga7ie465b44tv64gompjrhfg6vlr37z56ydfp3r72mtzjdshy.py 2025-12-04T09:42:50.2815549Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2815593Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2815648Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2815744Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2816192Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2816229Z graph_break [] 2025-12-04T09:42:50.2816303Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2816436Z Compiled module path: /tmp/tmpoplidnd6/63/c63vzedrpz5mff44izoa3tc3ydaix2v7kwuivdyx6yux7vqmxywb.py 2025-12-04T09:42:50.2816523Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2816567Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2816620Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2816719Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2817063Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2817101Z graph_break [] 2025-12-04T09:42:50.2817173Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2819383Z Compiled module path: /tmp/tmpdfgeo7w5/tf/ctftrzrc4o733kz4zxfx6rb5d4oun42z3cklw5hwfbjinkfvtqnd.py 2025-12-04T09:42:50.2819466Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2819511Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2819609Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2819715Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2820060Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2820100Z graph_break [] 2025-12-04T09:42:50.2820172Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2820308Z Compiled module path: /tmp/tmpn89q44id/3i/c3icbyr3w4k67fup4sqnkqcx543nglp2g6ghaiasywz7w2wwgypf.py 2025-12-04T09:42:50.2820378Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2820422Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2820476Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2820580Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2820923Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2820977Z graph_break [] 2025-12-04T09:42:50.2821048Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2821186Z Compiled module path: /tmp/tmpcstih6z5/ph/cphxp4lw2y3hfokgonh365plllew5yywsuwqq6equcarnjyytbz4.py 2025-12-04T09:42:50.2821257Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2821298Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2821353Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2821454Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2821801Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2821838Z graph_break [] 2025-12-04T09:42:50.2821912Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2822046Z Compiled module path: /tmp/tmptvj3ne19/sq/csqekonyxqg6we5zmnkp6c3rs5kmxqrgmjxzik7rra566pmq4gf4.py 2025-12-04T09:42:50.2822118Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2822157Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2822214Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2822332Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2822676Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2822715Z graph_break [] 2025-12-04T09:42:50.2822788Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2822917Z Compiled module path: /tmp/tmp_gji2v_8/44/c44zgbr6cfmb7xytscj7r7grqjj7kedjw4trgxgavre4je4lr6mf.py 2025-12-04T09:42:50.2822989Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2823030Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2823086Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2823183Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2823549Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2823586Z graph_break [] 2025-12-04T09:42:50.2823658Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2823797Z Compiled module path: /tmp/tmprfjqjg3z/zm/czmug32tjsebcnrvboxhvgooxffr3noi5zajeugehoh2cxgpgvj6.py 2025-12-04T09:42:50.2823870Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2823910Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2823965Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2824062Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2824408Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2824458Z graph_break [] 2025-12-04T09:42:50.2824529Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2824665Z Compiled module path: /tmp/tmp43b74yex/z4/cz4bmg7upkojifkvmaahok76rso5cimijchtsziolfxy2og2u3s6.py 2025-12-04T09:42:50.2824736Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2824778Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2824831Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2824927Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2825272Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2825311Z graph_break [] 2025-12-04T09:42:50.2825383Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2825520Z Compiled module path: /tmp/tmpzvngxad2/s2/cs2ymnzlbv2mc6aqnuoopjwsdmnlit2ufxkuhjfgi5tx2ebxz7nc.py 2025-12-04T09:42:50.2825592Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2825633Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2825686Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2825782Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2826193Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2826230Z graph_break [] 2025-12-04T09:42:50.2826302Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2826437Z Compiled module path: /tmp/tmpog756mu8/gc/cgcpnuexvs5e6m6gdj73dx4n2q7uqijuuq3ubaujfvfcwinnsb33.py 2025-12-04T09:42:50.2826508Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2826549Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2826603Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2826702Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2827073Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2827112Z graph_break [] 2025-12-04T09:42:50.2827183Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2827314Z Compiled module path: /tmp/tmp_dv5jxa9/6k/c6ka75uhdnm6eucdbl66kr6ex2gra4r3jkpowp4fulvs7mx3ypml.py 2025-12-04T09:42:50.2827386Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2827426Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2827482Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2827577Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2827920Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2827958Z graph_break [] 2025-12-04T09:42:50.2828032Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2828193Z Compiled module path: /tmp/tmpgsy0s3rp/yf/cyf5fuhpzb2ye7v2a57nsttrv7dehivwfiyc6ipcr734hikanz2w.py 2025-12-04T09:42:50.2828266Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2828306Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2828360Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2828457Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2828804Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2828840Z graph_break [] 2025-12-04T09:42:50.2828913Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2829045Z Compiled module path: /tmp/tmp_hj1kmqe/qt/cqtjfgww4epqrnqrizrwk4djvl5xpnfncvqc2r3qdxnxtgqxpkcl.py 2025-12-04T09:42:50.2829119Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2829158Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2829213Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2829307Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2829651Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2829701Z graph_break [] 2025-12-04T09:42:50.2829773Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2829907Z Compiled module path: /tmp/tmpxlcjsc05/ko/ckoxwsmt4vqls6z6tx63k3uruzrpei5xkiuu24xjdowqcqqvmgyp.py 2025-12-04T09:42:50.2829984Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2830024Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2830080Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2830177Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2830523Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2830560Z graph_break [] 2025-12-04T09:42:50.2830654Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2830791Z Compiled module path: /tmp/tmpfwo2vg1z/r6/cr65mqq2yuhrqjtrnsbv5n2brgxyknu52tpxcuaubovpaebdxx74.py 2025-12-04T09:42:50.2830864Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2830906Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2830959Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2831057Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2831399Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2831439Z graph_break [] 2025-12-04T09:42:50.2831512Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2831648Z Compiled module path: /tmp/tmpffjovlv2/on/convo7nb66xejt2vizpg5taq2hzkarva6kvlpikhakxj45gv5juf.py 2025-12-04T09:42:50.2831719Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2831771Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2831825Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2831922Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2832265Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2832302Z graph_break [] 2025-12-04T09:42:50.2832375Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2832512Z Compiled module path: /tmp/tmpxwfglx8h/rp/crpmw273fvhq24tatgvd734xavjhkrdt6cmqwg2hmtkpxxexuppr.py 2025-12-04T09:42:50.2832583Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2832626Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2832680Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2832778Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2833122Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2833157Z graph_break [] 2025-12-04T09:42:50.2833230Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2833377Z Compiled module path: /tmp/tmp5psoaoef/dr/cdry3zflyho3wrc4yjjs7oro5tq4a7dzgs5va7zzcxb5de55fcz7.py 2025-12-04T09:42:50.2833449Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2833489Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2833546Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2833643Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2833985Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2834020Z graph_break [] 2025-12-04T09:42:50.2834092Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2834246Z Compiled module path: /tmp/tmpeeyfjgtm/dv/cdve6mxy7mo7r2ih3h4nwpgl5noavz33xgl63f7jsuo5j2yu5kub.py 2025-12-04T09:42:50.2834320Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2834361Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2834416Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2834512Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2834858Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2834894Z graph_break [] 2025-12-04T09:42:50.2834967Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2835105Z Compiled module path: /tmp/tmp22kfpiuu/zv/czvhfpdlijdncjn67yvczjfyiptrzzne2eiok46ned7k4tdptdzi.py 2025-12-04T09:42:50.2835177Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2835218Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2835272Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2835381Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2835723Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2835760Z graph_break [] 2025-12-04T09:42:50.2835831Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2836010Z Compiled module path: /tmp/tmpc1ytlvgs/5h/c5hn5kemane2up7uqe2zxrbxljjqlcs6xs5pq534ggoxivqtl7lo.py 2025-12-04T09:42:50.2836084Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2836125Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2836179Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2836276Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2836620Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2836657Z graph_break [] 2025-12-04T09:42:50.2836729Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2836873Z Compiled module path: /tmp/tmppmrqdeoa/q3/cq3a3kewslexzbdfbcrw5vv2uawnhkcmfdmbivorym7hva7r73e2.py 2025-12-04T09:42:50.2836962Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2837006Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2837059Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2837157Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2837504Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2837544Z graph_break [] 2025-12-04T09:42:50.2837614Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2837749Z Compiled module path: /tmp/tmpiak5v0ji/4z/c4zqz2p3nu7ar3n2w4ozmsk7gtbhaesxtbang4ezw7yr2t3exgb3.py 2025-12-04T09:42:50.2837821Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2837889Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2837943Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2838040Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2838383Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2838421Z graph_break [] 2025-12-04T09:42:50.2838496Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2838633Z Compiled module path: /tmp/tmputb90ews/ca/cca5udy25zmcg27vjmf36l42kvfmesf54ipj7i2begdb3lers2zw.py 2025-12-04T09:42:50.2838709Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2838750Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2838807Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2838903Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2839249Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2839301Z graph_break [] 2025-12-04T09:42:50.2839373Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2839510Z Compiled module path: /tmp/tmpyz8phy1t/jn/cjnln5dtyols5eyqvhdeqcoweuntrzgbftw3xboyqn3izxynbtun.py 2025-12-04T09:42:50.2839581Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2839623Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2839678Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2839774Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2840118Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2840157Z graph_break [] 2025-12-04T09:42:50.2840230Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2840363Z Compiled module path: /tmp/tmpbbugkt6v/3z/c3zgvjwsa76qdup6wal5bvf3ccdero3a2m5biswvmdy4zaciigj7.py 2025-12-04T09:42:50.2840434Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2840474Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2840540Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2840639Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2840985Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2841021Z graph_break [] 2025-12-04T09:42:50.2841094Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2841228Z Compiled module path: /tmp/tmpb8t1ft7q/sv/csvnn6axisvhezh36eq2emnmssqzcqy6lwvcn6uxfmhxldxx6rku.py 2025-12-04T09:42:50.2841300Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2841344Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2841397Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2841522Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2841866Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2841903Z graph_break [] 2025-12-04T09:42:50.2841974Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2842108Z Compiled module path: /tmp/tmpuake4uz1/is/cis2aq453bnnv4wrn5apdfizeefau53c7ytdpi7m4y7hvo6z3dck.py 2025-12-04T09:42:50.2842180Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2842221Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2842274Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2842371Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2842714Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2842764Z graph_break [] 2025-12-04T09:42:50.2842835Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2842965Z Compiled module path: /tmp/tmp_5sp4e1b/nj/cnjjvf2hskj64abhf3tcw6wgkrtmet7smjrbe6zg4om5nzlujwzg.py 2025-12-04T09:42:50.2843036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2843077Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2843131Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2843227Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2843571Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2843608Z graph_break [] 2025-12-04T09:42:50.2843679Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2843814Z Compiled module path: /tmp/tmpgrsri9x0/fw/cfwpxxh62dgwp44s2won22nlpb5vbdsypcxk22gyrjqo6hkmvro2.py 2025-12-04T09:42:50.2843885Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2843926Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2843979Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2844078Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2844434Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2844471Z graph_break [] 2025-12-04T09:42:50.2844542Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2844677Z Compiled module path: /tmp/tmpfphthme4/y3/cy3lbra7px6alzpx2jj7ub7lvbinwfejqdevqsgiwnipuox7vje2.py 2025-12-04T09:42:50.2844749Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2844789Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2844844Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2844939Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2845304Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2845341Z graph_break [] 2025-12-04T09:42:50.2845414Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2845543Z Compiled module path: /tmp/tmp_mo_qyhr/cd/ccdskbg4cigl5o4msvv6w22so7lwnpns4qllesqanqxbqio3g2to.py 2025-12-04T09:42:50.2845615Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2845655Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2845710Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2845804Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2846302Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2846338Z graph_break [] 2025-12-04T09:42:50.2846410Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2846565Z Compiled module path: /tmp/tmp3c0whvmo/fp/cfpnkdotiotnfh3ufogublj64f5pyhjwrbjkdv2so3lkiiycnppv.py 2025-12-04T09:42:50.2846637Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2846677Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2846731Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2846826Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2847167Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2847204Z graph_break [] 2025-12-04T09:42:50.2847275Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2847411Z Compiled module path: /tmp/tmpwlopd1b0/vq/cvqmf5wtw6nxqfkmp6rrehbxz76b2iwdfgo2zjl4frnbs7vpscm5.py 2025-12-04T09:42:50.2847482Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:42:50.2847523Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:42:50.2847576Z stats [('calls_captured', 3), ('unique_graphs', 1)] 2025-12-04T09:42:50.2847673Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:42:50.2848016Z inductor [('triton_bundler_save_kernel', 24), ('benchmarking.InductorBenchmarker.benchmark', 3), ('benchmarking.InductorBenchmarker.benchmark_gpu', 3), ('async_compile_cache_miss', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:42:50.2848066Z graph_break [] 2025-12-04T09:42:50.2848138Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:42:50.2848274Z Compiled module path: /tmp/tmp461n2xby/rc/crcl3yt4wc7xaogfaazikl5mpwhdgimsdmbuztcgbekusgsocl43.py 2025-12-04T09:42:50.2848509Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_kernel_benchmark/inductor.test_kernel_benchmark-05a8c0c9d49884d6.xml - 2025-12-04T09:42:50.2848570Z =========================== short test summary info ============================ 2025-12-04T09:42:50.2848791Z FAILED [5.1842s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2848835Z None UNK cufi44wmvf 2025-12-04T09:42:50.2848969Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2849140Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2849187Z ~~~~ <--- HERE 2025-12-04T09:42:50.2849328Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2849368Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2849370Z 2025-12-04T09:42:50.2849372Z 2025-12-04T09:42:50.2849447Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2849598Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2849600Z 2025-12-04T09:42:50.2849689Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2849888Z FAILED [4.8054s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2849930Z None UNK cufi44wmvf 2025-12-04T09:42:50.2850059Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2850211Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2850257Z ~~~~ <--- HERE 2025-12-04T09:42:50.2850396Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2850434Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2850437Z 2025-12-04T09:42:50.2850439Z 2025-12-04T09:42:50.2850513Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2850660Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2850662Z 2025-12-04T09:42:50.2850749Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2850943Z FAILED [5.0875s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2850985Z None UNK cufi44wmvf 2025-12-04T09:42:50.2851111Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2851249Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2851303Z ~~~~ <--- HERE 2025-12-04T09:42:50.2851443Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2851481Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2851485Z 2025-12-04T09:42:50.2851486Z 2025-12-04T09:42:50.2851560Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2851704Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2851707Z 2025-12-04T09:42:50.2851793Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2851986Z FAILED [5.2671s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2852028Z None UNK cufi44wmvf 2025-12-04T09:42:50.2852180Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2852319Z 0.009ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2852364Z ~~~~ <--- HERE 2025-12-04T09:42:50.2852502Z 0.009ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2852540Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2852542Z 2025-12-04T09:42:50.2852544Z 2025-12-04T09:42:50.2852616Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2852759Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2852762Z 2025-12-04T09:42:50.2852846Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2853039Z FAILED [4.8273s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2853080Z None UNK cufi44wmvf 2025-12-04T09:42:50.2853220Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2853358Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2853401Z ~~~~ <--- HERE 2025-12-04T09:42:50.2853538Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2853576Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2853579Z 2025-12-04T09:42:50.2853581Z 2025-12-04T09:42:50.2853655Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2853799Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2853801Z 2025-12-04T09:42:50.2853886Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2854077Z FAILED [4.8356s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2854119Z None UNK cufi44wmvf 2025-12-04T09:42:50.2854243Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2854380Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2854432Z ~~~~ <--- HERE 2025-12-04T09:42:50.2854572Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2854610Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2854612Z 2025-12-04T09:42:50.2854614Z 2025-12-04T09:42:50.2854687Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2854831Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2854833Z 2025-12-04T09:42:50.2854917Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2855109Z FAILED [4.9103s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2855150Z None UNK cufi44wmvf 2025-12-04T09:42:50.2855316Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2855453Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2855498Z ~~~~ <--- HERE 2025-12-04T09:42:50.2855635Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2855674Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2855676Z 2025-12-04T09:42:50.2855678Z 2025-12-04T09:42:50.2855750Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2855897Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2855900Z 2025-12-04T09:42:50.2856033Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2856224Z FAILED [4.8543s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2856288Z None UNK cufi44wmvf 2025-12-04T09:42:50.2856413Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2856550Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2856592Z ~~~~ <--- HERE 2025-12-04T09:42:50.2856729Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2856767Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2856769Z 2025-12-04T09:42:50.2856771Z 2025-12-04T09:42:50.2856847Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2856989Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2856992Z 2025-12-04T09:42:50.2857078Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2857270Z FAILED [4.9731s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2857310Z None UNK cufi44wmvf 2025-12-04T09:42:50.2857437Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2857576Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2857635Z ~~~~ <--- HERE 2025-12-04T09:42:50.2857777Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2857817Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2857819Z 2025-12-04T09:42:50.2857821Z 2025-12-04T09:42:50.2857892Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2858038Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2858040Z 2025-12-04T09:42:50.2858123Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2858316Z FAILED [4.9436s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2858358Z None UNK cufi44wmvf 2025-12-04T09:42:50.2858512Z 0.009ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2858650Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2858698Z ~~~~ <--- HERE 2025-12-04T09:42:50.2858836Z 0.010ms 0.000 GB 0.00GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2858875Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2858876Z 2025-12-04T09:42:50.2858878Z 2025-12-04T09:42:50.2858951Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2859093Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2859096Z 2025-12-04T09:42:50.2859185Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2859376Z FAILED [5.2033s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2859432Z None UNK cufi44wmvf 2025-12-04T09:42:50.2859557Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2859695Z 0.009ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2859740Z ~~~~ <--- HERE 2025-12-04T09:42:50.2859880Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2859919Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2859921Z 2025-12-04T09:42:50.2859924Z 2025-12-04T09:42:50.2859998Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2860140Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2860145Z 2025-12-04T09:42:50.2860227Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2860419Z FAILED [4.9421s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2860459Z None UNK cufi44wmvf 2025-12-04T09:42:50.2860584Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2860724Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2860780Z ~~~~ <--- HERE 2025-12-04T09:42:50.2860921Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2860960Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2860962Z 2025-12-04T09:42:50.2860964Z 2025-12-04T09:42:50.2861035Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2861179Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2861181Z 2025-12-04T09:42:50.2861264Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2861461Z FAILED [5.0374s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2861502Z None UNK cufi44wmvf 2025-12-04T09:42:50.2861648Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2861787Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2861831Z ~~~~ <--- HERE 2025-12-04T09:42:50.2861970Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2862007Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2862009Z 2025-12-04T09:42:50.2862010Z 2025-12-04T09:42:50.2862084Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2862228Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2862230Z 2025-12-04T09:42:50.2862317Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2862508Z FAILED [4.9823s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2862562Z None UNK cufi44wmvf 2025-12-04T09:42:50.2862686Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2862824Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2862866Z ~~~~ <--- HERE 2025-12-04T09:42:50.2863003Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2863044Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2863047Z 2025-12-04T09:42:50.2863049Z 2025-12-04T09:42:50.2863121Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2863265Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2863268Z 2025-12-04T09:42:50.2863351Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2863544Z FAILED [5.1733s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2863585Z None UNK cufi44wmvf 2025-12-04T09:42:50.2863710Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2863860Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2863906Z ~~~~ <--- HERE 2025-12-04T09:42:50.2864046Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2864086Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2864087Z 2025-12-04T09:42:50.2864089Z 2025-12-04T09:42:50.2864160Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2864305Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2864308Z 2025-12-04T09:42:50.2864390Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2864586Z FAILED [4.9936s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2864650Z None UNK cufi44wmvf 2025-12-04T09:42:50.2864774Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2864914Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2864957Z ~~~~ <--- HERE 2025-12-04T09:42:50.2865098Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2865136Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2865138Z 2025-12-04T09:42:50.2865140Z 2025-12-04T09:42:50.2865212Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2865357Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2865360Z 2025-12-04T09:42:50.2865446Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2865639Z FAILED [5.2223s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2865696Z None UNK cufi44wmvf 2025-12-04T09:42:50.2865820Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2866002Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2866045Z ~~~~ <--- HERE 2025-12-04T09:42:50.2866182Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2866224Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2866226Z 2025-12-04T09:42:50.2866228Z 2025-12-04T09:42:50.2866300Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2866444Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2866446Z 2025-12-04T09:42:50.2866528Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2866719Z FAILED [4.7890s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2866759Z None UNK cufi44wmvf 2025-12-04T09:42:50.2866884Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2867049Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2867093Z ~~~~ <--- HERE 2025-12-04T09:42:50.2867231Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2867270Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2867272Z 2025-12-04T09:42:50.2867273Z 2025-12-04T09:42:50.2867345Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2867488Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2867491Z 2025-12-04T09:42:50.2867574Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2867765Z FAILED [4.8791s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2867832Z None UNK cufi44wmvf 2025-12-04T09:42:50.2867957Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2868097Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2868140Z ~~~~ <--- HERE 2025-12-04T09:42:50.2868276Z 0.008ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2868314Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2868315Z 2025-12-04T09:42:50.2868317Z 2025-12-04T09:42:50.2868389Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2868534Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2868536Z 2025-12-04T09:42:50.2868622Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2868813Z FAILED [5.0239s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2868873Z None UNK cufi44wmvf 2025-12-04T09:42:50.2868998Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2869137Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2869180Z ~~~~ <--- HERE 2025-12-04T09:42:50.2869321Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2869359Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2869360Z 2025-12-04T09:42:50.2869362Z 2025-12-04T09:42:50.2869434Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2869578Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2869580Z 2025-12-04T09:42:50.2869663Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2869856Z FAILED [4.9146s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2869897Z None UNK cufi44wmvf 2025-12-04T09:42:50.2870022Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2870174Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2870217Z ~~~~ <--- HERE 2025-12-04T09:42:50.2870355Z 0.007ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2870393Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2870395Z 2025-12-04T09:42:50.2870396Z 2025-12-04T09:42:50.2870468Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2870610Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2870612Z 2025-12-04T09:42:50.2870695Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2870907Z FAILED [5.1448s] inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_pw_kernel_benchmark - RuntimeError: Expected to not find "GB/s" but found it 2025-12-04T09:42:50.2870951Z None UNK cufi44wmvf 2025-12-04T09:42:50.2871077Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2871218Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 2, num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None 2025-12-04T09:42:50.2871260Z ~~~~ <--- HERE 2025-12-04T09:42:50.2871398Z 0.006ms 0.000 GB 0.01GB/s 23 regs 0 spills 0 shared mem @ XBLOCK: 8, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None 2025-12-04T09:42:50.2871436Z From CHECK-NOT: GB/s 2025-12-04T09:42:50.2871438Z 2025-12-04T09:42:50.2871439Z 2025-12-04T09:42:50.2871512Z To execute this test, run the following from the base repo dir: 2025-12-04T09:42:50.2871657Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_kernel_benchmark.py TestKernelBenchmark.test_pw_kernel_benchmark 2025-12-04T09:42:50.2871659Z 2025-12-04T09:42:50.2871744Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:42:50.2871826Z ================== 22 failed, 78 passed in 561.53s (0:09:21) =================== 2025-12-04T09:42:50.2871830Z 2025-12-04T09:42:50.2872011Z FINISHED PRINTING LOG FILE of inductor/test_kernel_benchmark 1/1 (test/test-reports/inductor.test_kernel_benchmark_1.1_80b5ed88f0cc4a76_.log) 2025-12-04T09:42:50.2872014Z 2025-12-04T09:42:50.2872137Z Finished inductor/test_kernel_benchmark 1/1 ... [2025-12-04 09:42:50.185736][5635390.692139161], took 9.49min 2025-12-04T09:42:50.2872375Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:42:50.2872489Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:42:50.2872587Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T09:42:50.2872637Z Uploading artifacts took 0.00 seconds 2025-12-04T09:42:50.2872688Z inductor/test_kernel_benchmark 1/1 failed! 2025-12-04T09:42:50.2872806Z Running inductor/test_torchinductor_opinfo 1/12 ... [2025-12-04 09:42:50.280310][5635390.78670855] 2025-12-04T09:42:50.2872854Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:42:50.2873236Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_opinfo.py', '--shard-id=1', '--num-shards=12', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:42:50.280635] 2025-12-04T09:43:00.7839061Z 2025-12-04T09:43:00.7840462Z inductor/test_torchinductor_opinfo 1/12 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_opinfo_1.12_bbf359353e18e60b_.log 2025-12-04T09:43:00.7868044Z Running 50 items in this shard: test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32 2025-12-04T09:43:00.7897407Z 2025-12-04T09:43:00.7897852Z Finished inductor/test_torchinductor_opinfo 1/12 ... [2025-12-04 09:43:00.783881][5635401.290279974], took 0.18min 2025-12-04T09:43:00.7899210Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:43:00.8760319Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:43:00.8778210Z Running inductor/test_torchinductor_opinfo 7/12 ... [2025-12-04 09:43:00.877557][5635401.383954236] 2025-12-04T09:43:00.8778935Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:43:00.8784371Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_opinfo.py', '--shard-id=7', '--num-shards=12', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:43:00.877865] 2025-12-04T09:44:38.1181091Z 2025-12-04T09:44:38.1182497Z inductor/test_torchinductor_opinfo 7/12 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_opinfo_7.12_3abb3d807950d344_.log 2025-12-04T09:44:38.1236907Z Running 100 items in this shard: test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32 2025-12-04T09:44:38.1289135Z 2025-12-04T09:44:38.1289579Z Finished inductor/test_torchinductor_opinfo 7/12 ... [2025-12-04 09:44:38.117643][5635498.624046702], took 1.62min 2025-12-04T09:44:38.1290933Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:44:38.2092940Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:44:38.2107108Z Running inductor/test_pattern_matcher 1/1 ... [2025-12-04 09:44:38.210422][5635498.716820717] 2025-12-04T09:44:38.2107785Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:44:38.2110839Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_pattern_matcher.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:44:38.210763] 2025-12-04T09:54:17.2636184Z 2025-12-04T09:54:17.2637489Z PRINTING LOG FILE of inductor/test_pattern_matcher 1/1 (test/test-reports/inductor.test_pattern_matcher_1.1_8672400c1baf9dfa_.log) 2025-12-04T09:54:17.2638858Z Test results will be stored in test-reports/python-pytest/inductor.test_pattern_matcher/inductor.test_pattern_matcher-9f787b25300815d0.xml 2025-12-04T09:54:17.2639815Z ============================= test session starts ============================== 2025-12-04T09:54:17.2640545Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T09:54:17.2641171Z cachedir: .pytest_cache 2025-12-04T09:54:17.2641926Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T09:54:17.2642727Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T09:54:17.2643116Z configfile: pytest.ini 2025-12-04T09:54:17.2643895Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T09:54:17.2645371Z collecting ... collected 52 items 2025-12-04T09:54:17.2645835Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T09:54:17.2775244Z Running 250 items in this shard: test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes, test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes 2025-12-04T09:54:17.2870672Z 2025-12-04T09:54:17.2871072Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [6.2110s] [ 0%] 2025-12-04T09:54:17.2871993Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [2.2780s] [ 0%] 2025-12-04T09:54:17.2872936Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.8221s] [ 1%] 2025-12-04T09:54:17.2874374Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:44:55.224000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2876418Z E1204 09:44:55.224000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2877997Z E1204 09:44:55.224000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2879017Z E1204 09:44:55.262000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2880575Z E1204 09:44:55.262000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2882095Z E1204 09:44:55.262000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2883099Z E1204 09:44:56.911000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2884660Z E1204 09:44:56.911000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2886197Z E1204 09:44:56.911000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2887188Z E1204 09:44:56.913000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2888736Z E1204 09:44:56.913000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2890280Z E1204 09:44:56.913000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2891272Z E1204 09:44:56.933000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2892816Z E1204 09:44:56.933000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2894312Z E1204 09:44:56.933000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2895305Z E1204 09:44:56.959000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2896993Z E1204 09:44:56.959000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2898478Z E1204 09:44:56.959000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2899473Z E1204 09:44:56.982000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2901039Z E1204 09:44:56.982000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2902525Z E1204 09:44:56.982000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2903517Z E1204 09:44:56.984000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2905064Z E1204 09:44:56.984000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2906658Z E1204 09:44:56.984000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2907653Z E1204 09:44:56.985000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2909201Z E1204 09:44:56.985000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2910690Z E1204 09:44:56.985000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2911694Z E1204 09:44:56.998000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2913243Z E1204 09:44:56.998000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2914736Z E1204 09:44:56.998000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2915732Z E1204 09:44:56.999000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2917328Z E1204 09:44:56.999000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2918871Z E1204 09:44:56.999000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2919863Z E1204 09:44:57.022000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2921405Z E1204 09:44:57.022000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2922891Z E1204 09:44:57.022000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2923883Z E1204 09:44:57.024000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2925439Z E1204 09:44:57.024000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2927120Z E1204 09:44:57.024000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2928114Z E1204 09:44:59.041000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2929660Z E1204 09:44:59.041000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2931143Z E1204 09:44:59.041000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2932143Z E1204 09:44:59.097000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.2933696Z E1204 09:44:59.097000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.2935191Z E1204 09:44:59.097000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.2935903Z PASSED [5.5753s] [ 1%] 2025-12-04T09:54:17.2937103Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.2938528Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.7136s] [ 2%] 2025-12-04T09:54:17.2939379Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8464s] [ 2%] 2025-12-04T09:54:17.2940219Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8820s] [ 2%] 2025-12-04T09:54:17.2941071Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.9819s] [ 2%] 2025-12-04T09:54:17.2941920Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8707s] [ 2%] 2025-12-04T09:54:17.2942766Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.3036s] [ 2%] 2025-12-04T09:54:17.2943627Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.9970s] [ 2%] 2025-12-04T09:54:17.2944467Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.9507s] [ 2%] 2025-12-04T09:54:17.2945324Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.9031s] [ 2%] 2025-12-04T09:54:17.2946228Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8283s] [ 2%] 2025-12-04T09:54:17.2947064Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8722s] [ 2%] 2025-12-04T09:54:17.2947958Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.2820s] [ 2%] 2025-12-04T09:54:17.2948800Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.0018s] [ 2%] 2025-12-04T09:54:17.2949632Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8610s] [ 2%] 2025-12-04T09:54:17.2950482Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.9924s] [ 2%] 2025-12-04T09:54:17.2951312Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8316s] [ 2%] 2025-12-04T09:54:17.2952156Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.9305s] [ 2%] 2025-12-04T09:54:17.2952996Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.3296s] [ 2%] 2025-12-04T09:54:17.2953832Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.9649s] [ 2%] 2025-12-04T09:54:17.2954677Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8274s] [ 2%] 2025-12-04T09:54:17.2955620Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8396s] [ 2%] 2025-12-04T09:54:17.2956504Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8762s] [ 2%] 2025-12-04T09:54:17.2957346Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm FAILED [0.5672s] [ 2%] 2025-12-04T09:54:17.2958194Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.7975s] [ 2%] 2025-12-04T09:54:17.2959036Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.2097s] [ 2%] 2025-12-04T09:54:17.2959878Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8223s] [ 2%] 2025-12-04T09:54:17.2960723Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8259s] [ 2%] 2025-12-04T09:54:17.2961565Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.7922s] [ 2%] 2025-12-04T09:54:17.2962415Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8713s] [ 2%] 2025-12-04T09:54:17.2963265Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8084s] [ 2%] 2025-12-04T09:54:17.2964105Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.1157s] [ 2%] 2025-12-04T09:54:17.2964985Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8114s] [ 2%] 2025-12-04T09:54:17.2965827Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.7966s] [ 2%] 2025-12-04T09:54:17.2966726Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.9068s] [ 2%] 2025-12-04T09:54:17.2967567Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.9004s] [ 2%] 2025-12-04T09:54:17.2968407Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.0842s] [ 2%] 2025-12-04T09:54:17.2969247Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.4869s] [ 2%] 2025-12-04T09:54:17.2970101Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.0821s] [ 2%] 2025-12-04T09:54:17.2970940Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.9002s] [ 2%] 2025-12-04T09:54:17.2971789Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.0960s] [ 2%] 2025-12-04T09:54:17.2972631Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.6998s] [ 2%] 2025-12-04T09:54:17.2973475Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.3054s] [ 2%] 2025-12-04T09:54:17.2974309Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.1318s] [ 2%] 2025-12-04T09:54:17.2975152Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.8489s] [ 2%] 2025-12-04T09:54:17.2976036Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.9921s] [ 2%] 2025-12-04T09:54:17.2976919Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [2.5909s] [ 2%] 2025-12-04T09:54:17.2977763Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.9443s] [ 2%] 2025-12-04T09:54:17.2978604Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.9503s] [ 2%] 2025-12-04T09:54:17.2979448Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm PASSED [1.8940s] [ 2%] 2025-12-04T09:54:17.2980343Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9222s] [ 2%] 2025-12-04T09:54:17.2981284Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [1.3284s] [ 2%] 2025-12-04T09:54:17.2982215Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9651s] [ 2%] 2025-12-04T09:54:17.2983145Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9722s] [ 2%] 2025-12-04T09:54:17.2984072Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9671s] [ 2%] 2025-12-04T09:54:17.2985076Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8580s] [ 2%] 2025-12-04T09:54:17.2986043Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8863s] [ 2%] 2025-12-04T09:54:17.2986977Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8928s] [ 2%] 2025-12-04T09:54:17.2987911Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9692s] [ 2%] 2025-12-04T09:54:17.2988831Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9236s] [ 2%] 2025-12-04T09:54:17.2989756Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9490s] [ 2%] 2025-12-04T09:54:17.2990685Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9120s] [ 2%] 2025-12-04T09:54:17.2991613Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9062s] [ 2%] 2025-12-04T09:54:17.2992545Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9617s] [ 2%] 2025-12-04T09:54:17.2993471Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [1.3668s] [ 2%] 2025-12-04T09:54:17.2994442Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8833s] [ 2%] 2025-12-04T09:54:17.2995362Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9029s] [ 2%] 2025-12-04T09:54:17.2996323Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8734s] [ 2%] 2025-12-04T09:54:17.2997249Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8840s] [ 2%] 2025-12-04T09:54:17.2998171Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8798s] [ 2%] 2025-12-04T09:54:17.2999102Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8911s] [ 2%] 2025-12-04T09:54:17.3000029Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8710s] [ 2%] 2025-12-04T09:54:17.3000959Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9365s] [ 2%] 2025-12-04T09:54:17.3001892Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9163s] [ 2%] 2025-12-04T09:54:17.3002823Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9086s] [ 2%] 2025-12-04T09:54:17.3003759Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9343s] [ 2%] 2025-12-04T09:54:17.3004688Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9383s] [ 2%] 2025-12-04T09:54:17.3005620Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [1.0638s] [ 2%] 2025-12-04T09:54:17.3006647Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [1.4438s] [ 2%] 2025-12-04T09:54:17.3007572Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9228s] [ 2%] 2025-12-04T09:54:17.3008495Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8951s] [ 2%] 2025-12-04T09:54:17.3009425Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9597s] [ 2%] 2025-12-04T09:54:17.3010355Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8970s] [ 2%] 2025-12-04T09:54:17.3011280Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9253s] [ 2%] 2025-12-04T09:54:17.3012203Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9077s] [ 2%] 2025-12-04T09:54:17.3013127Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8763s] [ 2%] 2025-12-04T09:54:17.3014132Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8979s] [ 2%] 2025-12-04T09:54:17.3015060Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9123s] [ 2%] 2025-12-04T09:54:17.3016039Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.8499s] [ 2%] 2025-12-04T09:54:17.3017012Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9172s] [ 2%] 2025-12-04T09:54:17.3017935Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9587s] [ 2%] 2025-12-04T09:54:17.3018857Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [1.4298s] [ 2%] 2025-12-04T09:54:17.3019779Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9700s] [ 2%] 2025-12-04T09:54:17.3020695Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9458s] [ 2%] 2025-12-04T09:54:17.3021620Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9513s] [ 2%] 2025-12-04T09:54:17.3022541Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9147s] [ 2%] 2025-12-04T09:54:17.3023508Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9237s] [ 2%] 2025-12-04T09:54:17.3024434Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9623s] [ 2%] 2025-12-04T09:54:17.3025358Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_bad_cases PASSED [0.9525s] [ 2%] 2025-12-04T09:54:17.3026327Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3957s] [ 2%] 2025-12-04T09:54:17.3027254Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.2629s] [ 2%] 2025-12-04T09:54:17.3028184Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3779s] [ 2%] 2025-12-04T09:54:17.3029113Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.8974s] [ 2%] 2025-12-04T09:54:17.3030039Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3763s] [ 2%] 2025-12-04T09:54:17.3030969Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3424s] [ 2%] 2025-12-04T09:54:17.3031893Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3157s] [ 2%] 2025-12-04T09:54:17.3032819Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.4105s] [ 2%] 2025-12-04T09:54:17.3033750Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.4627s] [ 2%] 2025-12-04T09:54:17.3034677Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.4523s] [ 2%] 2025-12-04T09:54:17.3035669Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works FAILED [1.4161s] [ 2%] 2025-12-04T09:54:17.3036625Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3003s] [ 2%] 2025-12-04T09:54:17.3037547Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.7886s] [ 2%] 2025-12-04T09:54:17.3038481Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3953s] [ 2%] 2025-12-04T09:54:17.3039406Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.2753s] [ 2%] 2025-12-04T09:54:17.3040327Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.4489s] [ 2%] 2025-12-04T09:54:17.3041247Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.2866s] [ 2%] 2025-12-04T09:54:17.3042163Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3131s] [ 2%] 2025-12-04T09:54:17.3043170Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.2853s] [ 2%] 2025-12-04T09:54:17.3044088Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3625s] [ 2%] 2025-12-04T09:54:17.3045006Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.8852s] [ 2%] 2025-12-04T09:54:17.3045976Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3474s] [ 2%] 2025-12-04T09:54:17.3046901Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.4219s] [ 2%] 2025-12-04T09:54:17.3047822Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3501s] [ 2%] 2025-12-04T09:54:17.3048746Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3987s] [ 2%] 2025-12-04T09:54:17.3049668Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3754s] [ 2%] 2025-12-04T09:54:17.3050604Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.5307s] [ 2%] 2025-12-04T09:54:17.3051520Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.4740s] [ 2%] 2025-12-04T09:54:17.3052503Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.9475s] [ 2%] 2025-12-04T09:54:17.3053420Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.5837s] [ 2%] 2025-12-04T09:54:17.3054337Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3724s] [ 2%] 2025-12-04T09:54:17.3055256Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.3965s] [ 2%] 2025-12-04T09:54:17.3056249Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.4121s] [ 2%] 2025-12-04T09:54:17.3057170Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.5363s] [ 2%] 2025-12-04T09:54:17.3058099Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.4227s] [ 2%] 2025-12-04T09:54:17.3058705Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.4143s] [ 2%] 2025-12-04T09:54:17.3059286Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.4299s] [ 2%] 2025-12-04T09:54:17.3059865Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.9888s] [ 2%] 2025-12-04T09:54:17.3060448Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.4297s] [ 2%] 2025-12-04T09:54:17.3061023Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works FAILED [0.9712s] [ 2%] 2025-12-04T09:54:17.3061601Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.4677s] [ 2%] 2025-12-04T09:54:17.3062216Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.6099s] [ 2%] 2025-12-04T09:54:17.3062800Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.9833s] [ 2%] 2025-12-04T09:54:17.3063383Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.7976s] [ 2%] 2025-12-04T09:54:17.3063966Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.4842s] [ 2%] 2025-12-04T09:54:17.3064549Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.7178s] [ 2%] 2025-12-04T09:54:17.3065125Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [3.3941s] [ 2%] 2025-12-04T09:54:17.3065715Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.6080s] [ 2%] 2025-12-04T09:54:17.3066325Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works PASSED [2.9188s] [ 2%] 2025-12-04T09:54:17.3067263Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:49:26.184000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3068504Z E1204 09:49:26.184000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3069464Z E1204 09:49:26.184000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3070096Z E1204 09:49:26.225000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3071071Z E1204 09:49:26.225000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3072018Z E1204 09:49:26.225000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3072652Z E1204 09:49:28.014000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3073658Z E1204 09:49:28.014000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3074592Z E1204 09:49:28.014000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3075216Z E1204 09:49:28.016000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3076223Z E1204 09:49:28.016000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3077158Z E1204 09:49:28.016000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3077788Z E1204 09:49:28.037000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3078783Z E1204 09:49:28.037000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3079722Z E1204 09:49:28.037000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3080351Z E1204 09:49:28.064000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3081358Z E1204 09:49:28.064000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3082295Z E1204 09:49:28.064000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3082928Z E1204 09:49:28.087000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3083902Z E1204 09:49:28.087000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3084832Z E1204 09:49:28.087000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3085460Z E1204 09:49:28.089000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3086510Z E1204 09:49:28.089000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3087455Z E1204 09:49:28.089000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3088084Z E1204 09:49:28.091000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3089056Z E1204 09:49:28.091000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3089996Z E1204 09:49:28.091000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3090628Z E1204 09:49:28.103000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3091599Z E1204 09:49:28.103000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3092564Z E1204 09:49:28.103000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3093190Z E1204 09:49:28.105000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3094161Z E1204 09:49:28.105000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3095105Z E1204 09:49:28.105000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3095731Z E1204 09:49:28.127000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3096739Z E1204 09:49:28.127000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3097681Z E1204 09:49:28.127000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3098310Z E1204 09:49:28.129000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3099284Z E1204 09:49:28.129000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3100248Z E1204 09:49:28.129000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3100871Z E1204 09:49:30.162000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3101843Z E1204 09:49:30.162000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3102781Z E1204 09:49:30.162000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3103406Z E1204 09:49:30.202000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3104433Z E1204 09:49:30.202000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3105366Z E1204 09:49:30.202000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3105786Z PASSED [5.7775s] [ 2%] 2025-12-04T09:54:17.3106485Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:49:31.994000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3107711Z E1204 09:49:31.994000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3108646Z E1204 09:49:31.994000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3109279Z E1204 09:49:32.034000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3110249Z E1204 09:49:32.034000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3111224Z E1204 09:49:32.034000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3111847Z E1204 09:49:33.779000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3112815Z E1204 09:49:33.779000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3113750Z E1204 09:49:33.779000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3114379Z E1204 09:49:33.781000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3115347Z E1204 09:49:33.781000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3116336Z E1204 09:49:33.781000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3116963Z E1204 09:49:33.806000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3117946Z E1204 09:49:33.806000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3118910Z E1204 09:49:33.806000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3119543Z E1204 09:49:33.834000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3120517Z E1204 09:49:33.834000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3121456Z E1204 09:49:33.834000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3122082Z E1204 09:49:33.857000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3123104Z E1204 09:49:33.857000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3124038Z E1204 09:49:33.857000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3124670Z E1204 09:49:33.859000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3125642Z E1204 09:49:33.859000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3126618Z E1204 09:49:33.859000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3127257Z E1204 09:49:33.861000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3128239Z E1204 09:49:33.861000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3129216Z E1204 09:49:33.861000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3129844Z E1204 09:49:33.873000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3130809Z E1204 09:49:33.873000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3131751Z E1204 09:49:33.873000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3132385Z E1204 09:49:33.875000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3133357Z E1204 09:49:33.875000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3134291Z E1204 09:49:33.875000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3134924Z E1204 09:49:33.897000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3135890Z E1204 09:49:33.897000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3136922Z E1204 09:49:33.897000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3137549Z E1204 09:49:33.899000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3138525Z E1204 09:49:33.899000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3139467Z E1204 09:49:33.899000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3140099Z E1204 09:49:35.723000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3141131Z E1204 09:49:35.723000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3142065Z E1204 09:49:35.723000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3142692Z E1204 09:49:35.759000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3143659Z E1204 09:49:35.759000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3144590Z E1204 09:49:35.759000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3145009Z PASSED [5.5959s] [ 2%] 2025-12-04T09:54:17.3145683Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:49:38.063000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3146934Z E1204 09:49:38.063000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3148157Z E1204 09:49:38.063000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3148789Z E1204 09:49:38.101000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3149756Z E1204 09:49:38.101000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3150696Z E1204 09:49:38.101000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3151323Z E1204 09:49:39.772000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3152291Z E1204 09:49:39.772000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3153228Z E1204 09:49:39.772000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3153855Z E1204 09:49:39.774000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3154830Z E1204 09:49:39.774000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3155797Z E1204 09:49:39.774000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3156479Z E1204 09:49:39.794000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3157467Z E1204 09:49:39.794000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3158408Z E1204 09:49:39.794000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3159036Z E1204 09:49:39.821000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3160061Z E1204 09:49:39.821000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3160999Z E1204 09:49:39.821000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3161637Z E1204 09:49:39.848000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3162602Z E1204 09:49:39.848000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3163540Z E1204 09:49:39.848000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3164171Z E1204 09:49:39.850000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3165148Z E1204 09:49:39.850000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3166149Z E1204 09:49:39.850000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3166778Z E1204 09:49:39.852000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3167756Z E1204 09:49:39.852000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3168691Z E1204 09:49:39.852000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3169325Z E1204 09:49:39.864000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3170298Z E1204 09:49:39.864000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3171236Z E1204 09:49:39.864000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3171865Z E1204 09:49:39.866000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3172832Z E1204 09:49:39.866000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3173803Z E1204 09:49:39.866000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3174429Z E1204 09:49:39.889000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3175399Z E1204 09:49:39.889000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3176364Z E1204 09:49:39.889000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3176989Z E1204 09:49:39.891000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3177969Z E1204 09:49:39.891000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3178970Z E1204 09:49:39.891000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3179599Z E1204 09:49:41.634000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3180576Z E1204 09:49:41.634000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3181509Z E1204 09:49:41.634000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3182137Z E1204 09:49:41.672000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3183109Z E1204 09:49:41.672000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3184049Z E1204 09:49:41.672000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3184495Z PASSED [5.9037s] [ 2%] 2025-12-04T09:54:17.3185170Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:49:43.400000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3186429Z E1204 09:49:43.400000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3187368Z E1204 09:49:43.400000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3188008Z E1204 09:49:43.439000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3188991Z E1204 09:49:43.439000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3189925Z E1204 09:49:43.439000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3190554Z E1204 09:49:45.209000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3191523Z E1204 09:49:45.209000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3192503Z E1204 09:49:45.209000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3193138Z E1204 09:49:45.211000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3194112Z E1204 09:49:45.211000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3195043Z E1204 09:49:45.211000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3195671Z E1204 09:49:45.232000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3196717Z E1204 09:49:45.232000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3197655Z E1204 09:49:45.232000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3198289Z E1204 09:49:45.259000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3199266Z E1204 09:49:45.259000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3200206Z E1204 09:49:45.259000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3200837Z E1204 09:49:45.282000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3201823Z E1204 09:49:45.282000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3202757Z E1204 09:49:45.282000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3203418Z E1204 09:49:45.284000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3204386Z E1204 09:49:45.284000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3205320Z E1204 09:49:45.284000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3205974Z E1204 09:49:45.285000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3206951Z E1204 09:49:45.285000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3207895Z E1204 09:49:45.285000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3208525Z E1204 09:49:45.301000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3209495Z E1204 09:49:45.301000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3210430Z E1204 09:49:45.301000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3211090Z E1204 09:49:45.303000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3212063Z E1204 09:49:45.303000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3212996Z E1204 09:49:45.303000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3213621Z E1204 09:49:45.325000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3214590Z E1204 09:49:45.325000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3215586Z E1204 09:49:45.325000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3216256Z E1204 09:49:45.327000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3217235Z E1204 09:49:45.327000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3218171Z E1204 09:49:45.327000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3218796Z E1204 09:49:47.132000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3219769Z E1204 09:49:47.132000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3220702Z E1204 09:49:47.132000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3221328Z E1204 09:49:47.171000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3222327Z E1204 09:49:47.171000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3223260Z E1204 09:49:47.171000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3223675Z PASSED [5.5167s] [ 2%] 2025-12-04T09:54:17.3224345Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:49:48.899000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3225563Z E1204 09:49:48.899000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3226538Z E1204 09:49:48.899000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3227164Z E1204 09:49:48.938000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3228140Z E1204 09:49:48.938000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3228600Z E1204 09:49:48.938000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3228927Z E1204 09:49:50.624000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3229402Z E1204 09:49:50.624000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3229858Z E1204 09:49:50.624000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3230168Z E1204 09:49:50.626000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3230640Z E1204 09:49:50.626000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3231123Z E1204 09:49:50.626000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3231432Z E1204 09:49:50.647000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3231907Z E1204 09:49:50.647000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3232362Z E1204 09:49:50.647000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3232672Z E1204 09:49:50.675000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3233151Z E1204 09:49:50.675000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3233609Z E1204 09:49:50.675000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3233917Z E1204 09:49:50.698000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3234408Z E1204 09:49:50.698000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3234862Z E1204 09:49:50.698000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3235172Z E1204 09:49:50.700000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3235647Z E1204 09:49:50.700000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3236127Z E1204 09:49:50.700000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3236440Z E1204 09:49:50.702000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3236918Z E1204 09:49:50.702000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3237375Z E1204 09:49:50.702000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3237700Z E1204 09:49:50.714000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3238179Z E1204 09:49:50.714000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3238641Z E1204 09:49:50.714000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3238948Z E1204 09:49:50.716000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3239424Z E1204 09:49:50.716000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3239879Z E1204 09:49:50.716000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3240223Z E1204 09:49:50.738000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3240697Z E1204 09:49:50.738000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3241150Z E1204 09:49:50.738000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3241456Z E1204 09:49:50.740000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3241930Z E1204 09:49:50.740000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3242387Z E1204 09:49:50.740000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3242697Z E1204 09:49:52.646000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3243190Z E1204 09:49:52.646000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3243651Z E1204 09:49:52.646000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3243966Z E1204 09:49:52.686000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3244448Z E1204 09:49:52.686000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3244912Z E1204 09:49:52.686000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3245122Z PASSED [5.4791s] [ 2%] 2025-12-04T09:54:17.3245453Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:49:54.949000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3246069Z E1204 09:49:54.949000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3246524Z E1204 09:49:54.949000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3246857Z E1204 09:49:54.988000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3247337Z E1204 09:49:54.988000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3247803Z E1204 09:49:54.988000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3248118Z E1204 09:49:56.688000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3248594Z E1204 09:49:56.688000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3249052Z E1204 09:49:56.688000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3249391Z E1204 09:49:56.690000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3249866Z E1204 09:49:56.690000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3250326Z E1204 09:49:56.690000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3250637Z E1204 09:49:56.709000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3251112Z E1204 09:49:56.709000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3251573Z E1204 09:49:56.709000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3251887Z E1204 09:49:56.734000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3252384Z E1204 09:49:56.734000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3252841Z E1204 09:49:56.734000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3253151Z E1204 09:49:56.756000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3253627Z E1204 09:49:56.756000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3254086Z E1204 09:49:56.756000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3254398Z E1204 09:49:56.758000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3254875Z E1204 09:49:56.758000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3255335Z E1204 09:49:56.758000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3255645Z E1204 09:49:56.760000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3256175Z E1204 09:49:56.760000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3256636Z E1204 09:49:56.760000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3256949Z E1204 09:49:56.774000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3257433Z E1204 09:49:56.774000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3257890Z E1204 09:49:56.774000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3258247Z E1204 09:49:56.776000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3258726Z E1204 09:49:56.776000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3259190Z E1204 09:49:56.776000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3259499Z E1204 09:49:56.798000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3259975Z E1204 09:49:56.798000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3260438Z E1204 09:49:56.798000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3260752Z E1204 09:49:56.800000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3261228Z E1204 09:49:56.800000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3261715Z E1204 09:49:56.800000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3262029Z E1204 09:49:58.626000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3262509Z E1204 09:49:58.626000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3262968Z E1204 09:49:58.626000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3263280Z E1204 09:49:58.664000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3263762Z E1204 09:49:58.664000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3264219Z E1204 09:49:58.664000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3264431Z PASSED [5.9472s] [ 2%] 2025-12-04T09:54:17.3264762Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:50:00.343000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3265384Z E1204 09:50:00.343000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3265843Z E1204 09:50:00.343000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3266225Z E1204 09:50:00.383000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3266700Z E1204 09:50:00.383000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3267163Z E1204 09:50:00.383000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3267515Z E1204 09:50:02.091000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3267988Z E1204 09:50:02.091000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3268443Z E1204 09:50:02.091000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3268752Z E1204 09:50:02.093000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3269225Z E1204 09:50:02.093000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3269684Z E1204 09:50:02.093000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3269993Z E1204 09:50:02.113000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3270467Z E1204 09:50:02.113000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3270956Z E1204 09:50:02.113000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3271264Z E1204 09:50:02.140000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3271737Z E1204 09:50:02.140000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3272197Z E1204 09:50:02.140000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3272505Z E1204 09:50:02.164000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3272979Z E1204 09:50:02.164000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3273437Z E1204 09:50:02.164000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3273752Z E1204 09:50:02.166000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3274225Z E1204 09:50:02.166000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3274711Z E1204 09:50:02.166000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3275023Z E1204 09:50:02.167000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3275495Z E1204 09:50:02.167000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3275995Z E1204 09:50:02.167000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3276303Z E1204 09:50:02.180000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3276821Z E1204 09:50:02.180000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3277281Z E1204 09:50:02.180000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3277593Z E1204 09:50:02.181000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3278065Z E1204 09:50:02.181000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3278519Z E1204 09:50:02.181000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3279588Z E1204 09:50:02.204000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3280058Z E1204 09:50:02.204000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3280533Z E1204 09:50:02.204000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3280839Z E1204 09:50:02.205000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3281313Z E1204 09:50:02.205000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3281771Z E1204 09:50:02.205000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3282080Z E1204 09:50:04.025000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3282554Z E1204 09:50:04.025000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3283012Z E1204 09:50:04.025000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3283321Z E1204 09:50:04.065000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3283798Z E1204 09:50:04.065000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3284274Z E1204 09:50:04.065000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3284482Z PASSED [5.5203s] [ 2%] 2025-12-04T09:54:17.3284808Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:50:05.862000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3285405Z E1204 09:50:05.862000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3285863Z E1204 09:50:05.862000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3286472Z E1204 09:50:05.900000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3286981Z E1204 09:50:05.900000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3287437Z E1204 09:50:05.900000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3287745Z E1204 09:50:07.632000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3288220Z E1204 09:50:07.632000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3288676Z E1204 09:50:07.632000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3288990Z E1204 09:50:07.634000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3289461Z E1204 09:50:07.634000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3289935Z E1204 09:50:07.634000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3290247Z E1204 09:50:07.655000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3290718Z E1204 09:50:07.655000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3291175Z E1204 09:50:07.655000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3291487Z E1204 09:50:07.682000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3291958Z E1204 09:50:07.682000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3292415Z E1204 09:50:07.682000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3292723Z E1204 09:50:07.704000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3293196Z E1204 09:50:07.704000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3293664Z E1204 09:50:07.704000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3293967Z E1204 09:50:07.706000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3294440Z E1204 09:50:07.706000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3294889Z E1204 09:50:07.706000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3295193Z E1204 09:50:07.708000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3295686Z E1204 09:50:07.708000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3296170Z E1204 09:50:07.708000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3296474Z E1204 09:50:07.720000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3296938Z E1204 09:50:07.720000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3297385Z E1204 09:50:07.720000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3297688Z E1204 09:50:07.722000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3298161Z E1204 09:50:07.722000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3298627Z E1204 09:50:07.722000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3298929Z E1204 09:50:07.744000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3299395Z E1204 09:50:07.744000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3299846Z E1204 09:50:07.744000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3300151Z E1204 09:50:07.746000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3300619Z E1204 09:50:07.746000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3301068Z E1204 09:50:07.746000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3301373Z E1204 09:50:09.565000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3301839Z E1204 09:50:09.565000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3302303Z E1204 09:50:09.565000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3302606Z E1204 09:50:09.605000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3303074Z E1204 09:50:09.605000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3303522Z E1204 09:50:09.605000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3303722Z PASSED [5.5405s] [ 2%] 2025-12-04T09:54:17.3304044Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:50:11.393000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3304659Z E1204 09:50:11.393000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3305109Z E1204 09:50:11.393000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3305412Z E1204 09:50:11.433000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3305877Z E1204 09:50:11.433000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3306355Z E1204 09:50:11.433000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3306659Z E1204 09:50:13.217000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3307131Z E1204 09:50:13.217000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3307602Z E1204 09:50:13.217000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3307904Z E1204 09:50:13.219000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3308371Z E1204 09:50:13.219000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3308819Z E1204 09:50:13.219000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3309126Z E1204 09:50:13.241000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3309596Z E1204 09:50:13.241000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3310046Z E1204 09:50:13.241000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3310347Z E1204 09:50:13.268000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3310814Z E1204 09:50:13.268000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3311283Z E1204 09:50:13.268000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3311585Z E1204 09:50:13.292000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3312054Z E1204 09:50:13.292000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3312501Z E1204 09:50:13.292000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3312804Z E1204 09:50:13.294000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3313295Z E1204 09:50:13.294000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3313749Z E1204 09:50:13.294000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3314052Z E1204 09:50:13.296000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3314524Z E1204 09:50:13.296000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3314974Z E1204 09:50:13.296000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3315276Z E1204 09:50:13.308000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3315743Z E1204 09:50:13.308000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3316218Z E1204 09:50:13.308000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3316537Z E1204 09:50:13.310000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3317004Z E1204 09:50:13.310000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3317454Z E1204 09:50:13.310000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3317760Z E1204 09:50:13.332000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3318227Z E1204 09:50:13.332000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3318680Z E1204 09:50:13.332000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3318984Z E1204 09:50:13.334000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3319455Z E1204 09:50:13.334000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3319905Z E1204 09:50:13.334000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3320223Z E1204 09:50:15.878000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3320687Z E1204 09:50:15.878000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3321140Z E1204 09:50:15.878000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3321443Z E1204 09:50:15.918000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3321910Z E1204 09:50:15.918000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3322391Z E1204 09:50:15.918000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3322592Z PASSED [6.3014s] [ 2%] 2025-12-04T09:54:17.3322911Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:50:17.726000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3323496Z E1204 09:50:17.726000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3323944Z E1204 09:50:17.726000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3324249Z E1204 09:50:17.765000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3324720Z E1204 09:50:17.765000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3325171Z E1204 09:50:17.765000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3325490Z E1204 09:50:19.467000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3325991Z E1204 09:50:19.467000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3326439Z E1204 09:50:19.467000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3326744Z E1204 09:50:19.469000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3327209Z E1204 09:50:19.469000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3327660Z E1204 09:50:19.469000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3327962Z E1204 09:50:19.490000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3328433Z E1204 09:50:19.490000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3328903Z E1204 09:50:19.490000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3329212Z E1204 09:50:19.517000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3329679Z E1204 09:50:19.517000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3330129Z E1204 09:50:19.517000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3330431Z E1204 09:50:19.540000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3330897Z E1204 09:50:19.540000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3331375Z E1204 09:50:19.540000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3331679Z E1204 09:50:19.542000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3332147Z E1204 09:50:19.542000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3332598Z E1204 09:50:19.542000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3332900Z E1204 09:50:19.544000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3349994Z E1204 09:50:19.544000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3350594Z E1204 09:50:19.544000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3350970Z E1204 09:50:19.556000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3351465Z E1204 09:50:19.556000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3351948Z E1204 09:50:19.556000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3352249Z E1204 09:50:19.558000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3352722Z E1204 09:50:19.558000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3353174Z E1204 09:50:19.558000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3353476Z E1204 09:50:19.580000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3353941Z E1204 09:50:19.580000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3354387Z E1204 09:50:19.580000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3354703Z E1204 09:50:19.582000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3355170Z E1204 09:50:19.582000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3355625Z E1204 09:50:19.582000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3355964Z E1204 09:50:21.483000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3356431Z E1204 09:50:21.483000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3356884Z E1204 09:50:21.483000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3357227Z E1204 09:50:21.522000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3357695Z E1204 09:50:21.522000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3358142Z E1204 09:50:21.522000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3358343Z PASSED [5.5289s] [ 2%] 2025-12-04T09:54:17.3358675Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:50:23.229000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3359265Z E1204 09:50:23.229000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3359715Z E1204 09:50:23.229000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3360036Z E1204 09:50:23.272000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3360506Z E1204 09:50:23.272000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3360958Z E1204 09:50:23.272000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3361261Z E1204 09:50:25.009000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3361732Z E1204 09:50:25.009000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3362181Z E1204 09:50:25.009000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3362483Z E1204 09:50:25.011000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3362945Z E1204 09:50:25.011000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3363394Z E1204 09:50:25.011000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3363714Z E1204 09:50:25.032000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3364180Z E1204 09:50:25.032000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3364626Z E1204 09:50:25.032000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3364923Z E1204 09:50:25.060000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3365388Z E1204 09:50:25.060000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3365860Z E1204 09:50:25.060000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3366200Z E1204 09:50:25.083000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3366666Z E1204 09:50:25.083000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3367119Z E1204 09:50:25.083000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3367423Z E1204 09:50:25.085000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3367892Z E1204 09:50:25.085000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3368344Z E1204 09:50:25.085000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3368644Z E1204 09:50:25.087000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3369128Z E1204 09:50:25.087000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3369576Z E1204 09:50:25.087000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3369877Z E1204 09:50:25.100000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3370349Z E1204 09:50:25.100000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3370802Z E1204 09:50:25.100000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3371102Z E1204 09:50:25.102000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3371568Z E1204 09:50:25.102000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3372015Z E1204 09:50:25.102000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3372318Z E1204 09:50:25.124000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3372801Z E1204 09:50:25.124000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3373249Z E1204 09:50:25.124000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3373547Z E1204 09:50:25.126000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3374015Z E1204 09:50:25.126000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3374468Z E1204 09:50:25.126000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3374794Z E1204 09:50:27.006000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3375261Z E1204 09:50:27.006000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3375711Z E1204 09:50:27.006000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3376034Z E1204 09:50:27.046000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3376501Z E1204 09:50:27.046000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3376952Z E1204 09:50:27.046000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3377157Z PASSED [5.6235s] [ 2%] 2025-12-04T09:54:17.3377482Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:50:28.864000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3378090Z E1204 09:50:28.864000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3378538Z E1204 09:50:28.864000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3378839Z E1204 09:50:28.903000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3379306Z E1204 09:50:28.903000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3379753Z E1204 09:50:28.903000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3380053Z E1204 09:50:30.706000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3380518Z E1204 09:50:30.706000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3380962Z E1204 09:50:30.706000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3381260Z E1204 09:50:30.708000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3381738Z E1204 09:50:30.708000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3382184Z E1204 09:50:30.708000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3382481Z E1204 09:50:30.728000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3382943Z E1204 09:50:30.728000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3383387Z E1204 09:50:30.728000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3383716Z E1204 09:50:30.758000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3384180Z E1204 09:50:30.758000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3384627Z E1204 09:50:30.758000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3384928Z E1204 09:50:30.783000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3385396Z E1204 09:50:30.783000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3385845Z E1204 09:50:30.783000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3386172Z E1204 09:50:30.785000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3386651Z E1204 09:50:30.785000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3387094Z E1204 09:50:30.785000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3387391Z E1204 09:50:30.787000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3387857Z E1204 09:50:30.787000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3388303Z E1204 09:50:30.787000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3388604Z E1204 09:50:30.799000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3389072Z E1204 09:50:30.799000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3389516Z E1204 09:50:30.799000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3389814Z E1204 09:50:30.801000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3390291Z E1204 09:50:30.801000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3390733Z E1204 09:50:30.801000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3391033Z E1204 09:50:30.823000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3391496Z E1204 09:50:30.823000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3391942Z E1204 09:50:30.823000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3392241Z E1204 09:50:30.825000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3392740Z E1204 09:50:30.825000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3393188Z E1204 09:50:30.825000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3393486Z E1204 09:50:32.650000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3393952Z E1204 09:50:32.650000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3394397Z E1204 09:50:32.650000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3394697Z E1204 09:50:32.687000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3395163Z E1204 09:50:32.687000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3395623Z E1204 09:50:32.687000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3395821Z PASSED [5.5857s] [ 2%] 2025-12-04T09:54:17.3396170Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:50:35.051000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3396752Z E1204 09:50:35.051000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3397198Z E1204 09:50:35.051000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3397496Z E1204 09:50:35.088000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3397961Z E1204 09:50:35.088000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3398407Z E1204 09:50:35.088000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3398704Z E1204 09:50:36.848000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3399188Z E1204 09:50:36.848000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3399633Z E1204 09:50:36.848000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3399936Z E1204 09:50:36.850000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3400397Z E1204 09:50:36.850000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3400841Z E1204 09:50:36.850000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3401143Z E1204 09:50:36.869000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3401635Z E1204 09:50:36.869000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3402081Z E1204 09:50:36.869000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3402381Z E1204 09:50:36.894000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3402843Z E1204 09:50:36.894000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3403292Z E1204 09:50:36.894000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3403596Z E1204 09:50:36.916000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3404062Z E1204 09:50:36.916000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3404525Z E1204 09:50:36.916000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3404823Z E1204 09:50:36.917000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3405286Z E1204 09:50:36.917000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3405732Z E1204 09:50:36.917000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3406052Z E1204 09:50:36.919000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3406523Z E1204 09:50:36.919000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3406971Z E1204 09:50:36.919000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3407270Z E1204 09:50:36.931000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3407735Z E1204 09:50:36.931000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3408194Z E1204 09:50:36.931000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3408497Z E1204 09:50:36.933000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3408960Z E1204 09:50:36.933000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3409405Z E1204 09:50:36.933000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3409704Z E1204 09:50:36.955000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3410195Z E1204 09:50:36.955000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3410640Z E1204 09:50:36.955000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3410938Z E1204 09:50:36.957000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3411400Z E1204 09:50:36.957000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3411846Z E1204 09:50:36.957000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3412144Z E1204 09:50:38.761000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3412608Z E1204 09:50:38.761000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3413074Z E1204 09:50:38.761000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3413374Z E1204 09:50:38.800000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3413838Z E1204 09:50:38.800000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3414282Z E1204 09:50:38.800000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3414484Z PASSED [6.0121s] [ 2%] 2025-12-04T09:54:17.3414803Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:50:40.457000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3415384Z E1204 09:50:40.457000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3415834Z E1204 09:50:40.457000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3416162Z E1204 09:50:40.496000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3416628Z E1204 09:50:40.496000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3417090Z E1204 09:50:40.496000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3417389Z E1204 09:50:42.282000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3417854Z E1204 09:50:42.282000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3418298Z E1204 09:50:42.282000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3418600Z E1204 09:50:42.284000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3419098Z E1204 09:50:42.284000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3419551Z E1204 09:50:42.284000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3419854Z E1204 09:50:42.309000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3420319Z E1204 09:50:42.309000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3420770Z E1204 09:50:42.309000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3421078Z E1204 09:50:42.337000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3421544Z E1204 09:50:42.337000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3422017Z E1204 09:50:42.337000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3422319Z E1204 09:50:42.363000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3422789Z E1204 09:50:42.363000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3423242Z E1204 09:50:42.363000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3423547Z E1204 09:50:42.365000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3424015Z E1204 09:50:42.365000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3424470Z E1204 09:50:42.365000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3424774Z E1204 09:50:42.367000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3425242Z E1204 09:50:42.367000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3425711Z E1204 09:50:42.367000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3426047Z E1204 09:50:42.380000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3426516Z E1204 09:50:42.380000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3426967Z E1204 09:50:42.380000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3427269Z E1204 09:50:42.382000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3427771Z E1204 09:50:42.382000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3428222Z E1204 09:50:42.382000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3428527Z E1204 09:50:42.404000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3428993Z E1204 09:50:42.404000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3429441Z E1204 09:50:42.404000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3429741Z E1204 09:50:42.406000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3430212Z E1204 09:50:42.406000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3430676Z E1204 09:50:42.406000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3430979Z E1204 09:50:44.233000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3431448Z E1204 09:50:44.233000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3431898Z E1204 09:50:44.233000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3432201Z E1204 09:50:44.273000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3432671Z E1204 09:50:44.273000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3433118Z E1204 09:50:44.273000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3433319Z PASSED [5.6172s] [ 2%] 2025-12-04T09:54:17.3433639Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:50:46.073000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3434228Z E1204 09:50:46.073000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3434698Z E1204 09:50:46.073000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3435000Z E1204 09:50:46.111000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3435471Z E1204 09:50:46.111000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3435951Z E1204 09:50:46.111000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3436254Z E1204 09:50:47.833000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3436759Z E1204 09:50:47.833000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3437210Z E1204 09:50:47.833000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3437513Z E1204 09:50:47.835000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3437986Z E1204 09:50:47.835000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3438434Z E1204 09:50:47.835000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3438739Z E1204 09:50:47.856000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3439206Z E1204 09:50:47.856000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3439673Z E1204 09:50:47.856000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3439977Z E1204 09:50:47.883000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3440444Z E1204 09:50:47.883000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3440896Z E1204 09:50:47.883000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3441203Z E1204 09:50:47.906000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3441668Z E1204 09:50:47.906000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3442121Z E1204 09:50:47.906000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3442422Z E1204 09:50:47.908000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3442890Z E1204 09:50:47.908000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3443352Z E1204 09:50:47.908000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3443654Z E1204 09:50:47.910000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3444127Z E1204 09:50:47.910000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3444577Z E1204 09:50:47.910000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3444879Z E1204 09:50:47.922000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3445346Z E1204 09:50:47.922000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3445820Z E1204 09:50:47.922000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3446159Z E1204 09:50:47.924000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3446628Z E1204 09:50:47.924000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3447077Z E1204 09:50:47.924000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3447381Z E1204 09:50:47.946000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3447850Z E1204 09:50:47.946000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3448305Z E1204 09:50:47.946000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3448630Z E1204 09:50:47.947000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3449100Z E1204 09:50:47.947000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3449555Z E1204 09:50:47.947000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3449858Z E1204 09:50:49.766000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3450330Z E1204 09:50:49.766000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3450784Z E1204 09:50:49.766000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3451085Z E1204 09:50:49.804000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3451551Z E1204 09:50:49.804000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3451998Z E1204 09:50:49.804000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3452222Z PASSED [5.5145s] [ 2%] 2025-12-04T09:54:17.3452548Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:50:51.595000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3453136Z E1204 09:50:51.595000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3453587Z E1204 09:50:51.595000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3453894Z E1204 09:50:51.633000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3454388Z E1204 09:50:51.633000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3454836Z E1204 09:50:51.633000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3455139Z E1204 09:50:53.309000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3455606Z E1204 09:50:53.309000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3456081Z E1204 09:50:53.309000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3456382Z E1204 09:50:53.311000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3456854Z E1204 09:50:53.311000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3457312Z E1204 09:50:53.311000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3457648Z E1204 09:50:53.331000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3458117Z E1204 09:50:53.331000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3458564Z E1204 09:50:53.331000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3458871Z E1204 09:50:53.357000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3459342Z E1204 09:50:53.357000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3459791Z E1204 09:50:53.357000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3460091Z E1204 09:50:53.380000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3460560Z E1204 09:50:53.380000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3461016Z E1204 09:50:53.380000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3461343Z E1204 09:50:53.382000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3461813Z E1204 09:50:53.382000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3462265Z E1204 09:50:53.382000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3462571Z E1204 09:50:53.384000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3463041Z E1204 09:50:53.384000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3463525Z E1204 09:50:53.384000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3463832Z E1204 09:50:53.396000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3464301Z E1204 09:50:53.396000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3464757Z E1204 09:50:53.396000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3465059Z E1204 09:50:53.398000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3465529Z E1204 09:50:53.398000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3466008Z E1204 09:50:53.398000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3466310Z E1204 09:50:53.419000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3466790Z E1204 09:50:53.419000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3467238Z E1204 09:50:53.419000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3467541Z E1204 09:50:53.421000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3468015Z E1204 09:50:53.421000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3468465Z E1204 09:50:53.421000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3468768Z E1204 09:50:55.972000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3469232Z E1204 09:50:55.972000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3469681Z E1204 09:50:55.972000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3469982Z E1204 09:50:56.016000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3470467Z E1204 09:50:56.016000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3470917Z E1204 09:50:56.016000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3471119Z PASSED [6.2101s] [ 2%] 2025-12-04T09:54:17.3471444Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:50:57.831000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3472033Z E1204 09:50:57.831000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3472510Z E1204 09:50:57.831000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3472814Z E1204 09:50:57.870000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3473285Z E1204 09:50:57.870000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3473737Z E1204 09:50:57.870000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3474038Z E1204 09:50:59.553000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3474509Z E1204 09:50:59.553000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3474959Z E1204 09:50:59.553000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3475278Z E1204 09:50:59.555000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3475747Z E1204 09:50:59.555000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3476223Z E1204 09:50:59.555000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3476525Z E1204 09:50:59.576000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3476997Z E1204 09:50:59.576000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3477453Z E1204 09:50:59.576000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3477754Z E1204 09:50:59.603000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3478224Z E1204 09:50:59.603000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3478677Z E1204 09:50:59.603000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3479005Z E1204 09:50:59.626000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3479473Z E1204 09:50:59.626000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3479921Z E1204 09:50:59.626000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3480227Z E1204 09:50:59.628000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3480695Z E1204 09:50:59.628000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3481145Z E1204 09:50:59.628000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3481475Z E1204 09:50:59.630000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3481946Z E1204 09:50:59.630000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3482403Z E1204 09:50:59.630000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3482706Z E1204 09:50:59.642000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3483172Z E1204 09:50:59.642000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3483622Z E1204 09:50:59.642000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3483922Z E1204 09:50:59.644000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3484409Z E1204 09:50:59.644000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3484855Z E1204 09:50:59.644000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3485157Z E1204 09:50:59.666000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3485625Z E1204 09:50:59.666000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3486114Z E1204 09:50:59.666000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3486420Z E1204 09:50:59.668000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3486892Z E1204 09:50:59.668000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3487347Z E1204 09:50:59.668000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3487652Z E1204 09:51:01.515000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3488140Z E1204 09:51:01.515000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3488596Z E1204 09:51:01.515000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3488904Z E1204 09:51:01.554000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3489372Z E1204 09:51:01.554000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3489824Z E1204 09:51:01.554000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3490032Z PASSED [5.5051s] [ 2%] 2025-12-04T09:54:17.3491110Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:51:03.322000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3491695Z E1204 09:51:03.322000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3492145Z E1204 09:51:03.322000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3492451Z E1204 09:51:03.361000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3492920Z E1204 09:51:03.361000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3493372Z E1204 09:51:03.361000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3493675Z E1204 09:51:05.014000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3494159Z E1204 09:51:05.014000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3494608Z E1204 09:51:05.014000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3494912Z E1204 09:51:05.017000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3495383Z E1204 09:51:05.017000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3495836Z E1204 09:51:05.017000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3496217Z E1204 09:51:05.037000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3496689Z E1204 09:51:05.037000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3497141Z E1204 09:51:05.037000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3497446Z E1204 09:51:05.063000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3497934Z E1204 09:51:05.063000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3498387Z E1204 09:51:05.063000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3498692Z E1204 09:51:05.086000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3499160Z E1204 09:51:05.086000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3499613Z E1204 09:51:05.086000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3499943Z E1204 09:51:05.088000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3500410Z E1204 09:51:05.088000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3500862Z E1204 09:51:05.088000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3501165Z E1204 09:51:05.090000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3501634Z E1204 09:51:05.090000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3502087Z E1204 09:51:05.090000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3502395Z E1204 09:51:05.102000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3502862Z E1204 09:51:05.102000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3503328Z E1204 09:51:05.102000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3503632Z E1204 09:51:05.104000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3504098Z E1204 09:51:05.104000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3504548Z E1204 09:51:05.104000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3504857Z E1204 09:51:05.126000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3505153Z E1204 09:51:05.126000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3505281Z E1204 09:51:05.126000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3505421Z E1204 09:51:05.127000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3505715Z E1204 09:51:05.127000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3505849Z E1204 09:51:05.127000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3506026Z E1204 09:51:06.947000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3506317Z E1204 09:51:06.947000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3506445Z E1204 09:51:06.947000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3506587Z E1204 09:51:06.986000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3506905Z E1204 09:51:06.986000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3507033Z E1204 09:51:06.986000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3507074Z PASSED [5.4971s] [ 2%] 2025-12-04T09:54:17.3507341Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:51:08.794000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3507632Z E1204 09:51:08.794000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3507759Z E1204 09:51:08.794000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3507900Z E1204 09:51:08.834000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3508191Z E1204 09:51:08.834000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3508332Z E1204 09:51:08.834000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3508476Z E1204 09:51:10.599000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3508765Z E1204 09:51:10.599000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3508895Z E1204 09:51:10.599000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3509041Z E1204 09:51:10.601000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3509331Z E1204 09:51:10.601000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3509458Z E1204 09:51:10.601000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3509599Z E1204 09:51:10.622000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3509891Z E1204 09:51:10.622000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3510029Z E1204 09:51:10.622000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3510173Z E1204 09:51:10.649000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3510461Z E1204 09:51:10.649000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3510589Z E1204 09:51:10.649000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3510734Z E1204 09:51:10.673000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3511042Z E1204 09:51:10.673000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3511171Z E1204 09:51:10.673000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3511312Z E1204 09:51:10.675000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3511604Z E1204 09:51:10.675000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3511728Z E1204 09:51:10.675000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3511875Z E1204 09:51:10.677000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3512164Z E1204 09:51:10.677000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3512304Z E1204 09:51:10.677000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3512448Z E1204 09:51:10.689000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3512736Z E1204 09:51:10.689000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3512865Z E1204 09:51:10.689000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3513005Z E1204 09:51:10.691000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3513298Z E1204 09:51:10.691000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3513423Z E1204 09:51:10.691000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3513569Z E1204 09:51:10.713000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3513857Z E1204 09:51:10.713000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3513995Z E1204 09:51:10.713000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3514139Z E1204 09:51:10.715000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3514429Z E1204 09:51:10.715000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3514556Z E1204 09:51:10.715000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3514696Z E1204 09:51:12.597000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3515013Z E1204 09:51:12.597000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3515138Z E1204 09:51:12.597000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3515284Z E1204 09:51:12.637000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3515572Z E1204 09:51:12.637000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3515699Z E1204 09:51:12.637000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3515739Z PASSED [5.6512s] [ 2%] 2025-12-04T09:54:17.3516037Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:51:14.540000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3516331Z E1204 09:51:14.540000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3516471Z E1204 09:51:14.540000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3516615Z E1204 09:51:14.579000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3516903Z E1204 09:51:14.579000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3517032Z E1204 09:51:14.579000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3517172Z E1204 09:51:17.088000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3517466Z E1204 09:51:17.088000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3517590Z E1204 09:51:17.088000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3517732Z E1204 09:51:17.090000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3518026Z E1204 09:51:17.090000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3518163Z E1204 09:51:17.090000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3518308Z E1204 09:51:17.113000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3518596Z E1204 09:51:17.113000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3518722Z E1204 09:51:17.113000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3518863Z E1204 09:51:17.141000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3519177Z E1204 09:51:17.141000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3519300Z E1204 09:51:17.141000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3519444Z E1204 09:51:17.164000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3519739Z E1204 09:51:17.164000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3519863Z E1204 09:51:17.164000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3520006Z E1204 09:51:17.166000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3520296Z E1204 09:51:17.166000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3520432Z E1204 09:51:17.166000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3520574Z E1204 09:51:17.168000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3520866Z E1204 09:51:17.168000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3520989Z E1204 09:51:17.168000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3521134Z E1204 09:51:17.181000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3521427Z E1204 09:51:17.181000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3521552Z E1204 09:51:17.181000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3521695Z E1204 09:51:17.182000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3521989Z E1204 09:51:17.182000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3522126Z E1204 09:51:17.182000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3522266Z E1204 09:51:17.205000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3522560Z E1204 09:51:17.205000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3522685Z E1204 09:51:17.205000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3522828Z E1204 09:51:17.207000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3523139Z E1204 09:51:17.207000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3523266Z E1204 09:51:17.207000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3523412Z E1204 09:51:18.986000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3523702Z E1204 09:51:18.986000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3523829Z E1204 09:51:18.986000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3523970Z E1204 09:51:19.027000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3524266Z E1204 09:51:19.027000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3524391Z E1204 09:51:19.027000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3524447Z PASSED [6.4212s] [ 2%] 2025-12-04T09:54:17.3524707Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:51:21.009000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3524998Z E1204 09:51:21.009000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3525125Z E1204 09:51:21.009000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3525269Z E1204 09:51:21.048000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3525560Z E1204 09:51:21.048000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3525685Z E1204 09:51:21.048000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3525832Z E1204 09:51:22.936000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3526152Z E1204 09:51:22.936000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3526295Z E1204 09:51:22.936000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3526437Z E1204 09:51:22.938000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3526732Z E1204 09:51:22.938000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3526858Z E1204 09:51:22.938000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3527001Z E1204 09:51:22.959000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3527316Z E1204 09:51:22.959000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3527441Z E1204 09:51:22.959000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3527586Z E1204 09:51:22.986000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3527874Z E1204 09:51:22.986000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3528001Z E1204 09:51:22.986000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3528142Z E1204 09:51:23.009000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3528437Z E1204 09:51:23.009000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3528565Z E1204 09:51:23.009000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3528718Z E1204 09:51:23.011000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3529011Z E1204 09:51:23.011000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3529135Z E1204 09:51:23.011000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3529281Z E1204 09:51:23.013000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3529571Z E1204 09:51:23.013000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3529699Z E1204 09:51:23.013000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3529839Z E1204 09:51:23.025000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3530131Z E1204 09:51:23.025000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3530253Z E1204 09:51:23.025000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3530414Z E1204 09:51:23.027000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3530704Z E1204 09:51:23.027000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3530830Z E1204 09:51:23.027000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3530975Z E1204 09:51:23.049000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3531264Z E1204 09:51:23.049000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3531415Z E1204 09:51:23.049000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3531555Z E1204 09:51:23.051000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3531847Z E1204 09:51:23.051000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3531970Z E1204 09:51:23.051000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3532112Z E1204 09:51:25.029000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3532403Z E1204 09:51:25.029000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3532528Z E1204 09:51:25.029000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3532682Z E1204 09:51:25.068000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3532969Z E1204 09:51:25.068000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3533094Z E1204 09:51:25.068000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3533135Z PASSED [5.9687s] [ 2%] 2025-12-04T09:54:17.3533400Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:51:26.885000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3533689Z E1204 09:51:26.885000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3533815Z E1204 09:51:26.885000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3533957Z E1204 09:51:26.925000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3534247Z E1204 09:51:26.925000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3534389Z E1204 09:51:26.925000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3534532Z E1204 09:51:28.801000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3534823Z E1204 09:51:28.801000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3534947Z E1204 09:51:28.801000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3535091Z E1204 09:51:28.803000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3535380Z E1204 09:51:28.803000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3535526Z E1204 09:51:28.803000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3535670Z E1204 09:51:28.824000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3535988Z E1204 09:51:28.824000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3536118Z E1204 09:51:28.824000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3536259Z E1204 09:51:28.851000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3536552Z E1204 09:51:28.851000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3536676Z E1204 09:51:28.851000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3536838Z E1204 09:51:28.874000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3537126Z E1204 09:51:28.874000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3537255Z E1204 09:51:28.874000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3537396Z E1204 09:51:28.876000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3537692Z E1204 09:51:28.876000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3537820Z E1204 09:51:28.876000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3537961Z E1204 09:51:28.878000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3538253Z E1204 09:51:28.878000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3538376Z E1204 09:51:28.878000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3538537Z E1204 09:51:28.890000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3538827Z E1204 09:51:28.890000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3538954Z E1204 09:51:28.890000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3539094Z E1204 09:51:28.892000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3539384Z E1204 09:51:28.892000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3539512Z E1204 09:51:28.892000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3539685Z E1204 09:51:28.915000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3539978Z E1204 09:51:28.915000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3540103Z E1204 09:51:28.915000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3540248Z E1204 09:51:28.917000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3540537Z E1204 09:51:28.917000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3540668Z E1204 09:51:28.917000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3540810Z E1204 09:51:30.817000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3541117Z E1204 09:51:30.817000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3541242Z E1204 09:51:30.817000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3541382Z E1204 09:51:30.856000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3541676Z E1204 09:51:30.856000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3541802Z E1204 09:51:30.856000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3541848Z PASSED [5.7515s] [ 2%] 2025-12-04T09:54:17.3542108Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:51:32.719000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3542402Z E1204 09:51:32.719000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3542526Z E1204 09:51:32.719000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3542682Z E1204 09:51:32.758000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3542971Z E1204 09:51:32.758000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3543099Z E1204 09:51:32.758000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3543244Z E1204 09:51:34.559000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3543531Z E1204 09:51:34.559000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3543678Z E1204 09:51:34.559000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3543822Z E1204 09:51:34.561000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3544111Z E1204 09:51:34.561000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3544236Z E1204 09:51:34.561000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3544379Z E1204 09:51:34.582000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3544675Z E1204 09:51:34.582000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3544803Z E1204 09:51:34.582000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3544947Z E1204 09:51:34.609000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3545257Z E1204 09:51:34.609000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3545383Z E1204 09:51:34.609000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3545525Z E1204 09:51:34.631000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3545820Z E1204 09:51:34.631000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3545971Z E1204 09:51:34.631000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3546116Z E1204 09:51:34.633000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3546405Z E1204 09:51:34.633000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3546532Z E1204 09:51:34.633000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3546675Z E1204 09:51:34.635000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3546983Z E1204 09:51:34.635000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3547112Z E1204 09:51:34.635000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3547251Z E1204 09:51:34.647000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3547543Z E1204 09:51:34.647000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3547665Z E1204 09:51:34.647000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3547839Z E1204 09:51:34.649000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3548127Z E1204 09:51:34.649000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3548255Z E1204 09:51:34.649000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3548395Z E1204 09:51:34.671000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3548686Z E1204 09:51:34.671000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3548814Z E1204 09:51:34.671000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3548955Z E1204 09:51:34.673000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3549248Z E1204 09:51:34.673000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3549388Z E1204 09:51:34.673000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3549531Z E1204 09:51:37.230000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3549819Z E1204 09:51:37.230000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3549948Z E1204 09:51:37.230000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3550089Z E1204 09:51:37.274000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3550382Z E1204 09:51:37.274000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3550508Z E1204 09:51:37.274000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3550546Z PASSED [6.5155s] [ 2%] 2025-12-04T09:54:17.3550810Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:51:39.146000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3551114Z E1204 09:51:39.146000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3551240Z E1204 09:51:39.146000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3551385Z E1204 09:51:39.186000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3551679Z E1204 09:51:39.186000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3551802Z E1204 09:51:39.186000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3551970Z E1204 09:51:40.937000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3552261Z E1204 09:51:40.937000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3552386Z E1204 09:51:40.937000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3552529Z E1204 09:51:40.939000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3552816Z E1204 09:51:40.939000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3552944Z E1204 09:51:40.939000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3553085Z E1204 09:51:40.960000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3553388Z E1204 09:51:40.960000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3553512Z E1204 09:51:40.960000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3553657Z E1204 09:51:40.988000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3553950Z E1204 09:51:40.988000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3554074Z E1204 09:51:40.988000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3554217Z E1204 09:51:41.011000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3554506Z E1204 09:51:41.011000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3554632Z E1204 09:51:41.011000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3554773Z E1204 09:51:41.013000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3555075Z E1204 09:51:41.013000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3555198Z E1204 09:51:41.013000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3555343Z E1204 09:51:41.015000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3555633Z E1204 09:51:41.015000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3555761Z E1204 09:51:41.015000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3555906Z E1204 09:51:41.028000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3556252Z E1204 09:51:41.028000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3556380Z E1204 09:51:41.028000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3556521Z E1204 09:51:41.030000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3556812Z E1204 09:51:41.030000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3556935Z E1204 09:51:41.030000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3557082Z E1204 09:51:41.052000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3557372Z E1204 09:51:41.052000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3557510Z E1204 09:51:41.052000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3557654Z E1204 09:51:41.054000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3557943Z E1204 09:51:41.054000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3558072Z E1204 09:51:41.054000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3558214Z E1204 09:51:42.948000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3558508Z E1204 09:51:42.948000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3558631Z E1204 09:51:42.948000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3558775Z E1204 09:51:42.987000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3559064Z E1204 09:51:42.987000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3559204Z E1204 09:51:42.987000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3559246Z PASSED [5.6139s] [ 2%] 2025-12-04T09:54:17.3559507Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:51:44.755000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3559797Z E1204 09:51:44.755000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3559918Z E1204 09:51:44.755000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3560087Z E1204 09:51:44.794000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3560375Z E1204 09:51:44.794000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3560502Z E1204 09:51:44.794000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3560641Z E1204 09:51:46.610000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3560929Z E1204 09:51:46.610000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3561053Z E1204 09:51:46.610000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3561194Z E1204 09:51:46.612000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3561481Z E1204 09:51:46.612000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3561616Z E1204 09:51:46.612000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3561756Z E1204 09:51:46.633000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3562041Z E1204 09:51:46.633000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3562169Z E1204 09:51:46.633000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3562309Z E1204 09:51:46.659000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3562604Z E1204 09:51:46.659000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3562727Z E1204 09:51:46.659000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3562868Z E1204 09:51:46.682000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3563157Z E1204 09:51:46.682000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3563290Z E1204 09:51:46.682000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3563435Z E1204 09:51:46.684000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3563721Z E1204 09:51:46.684000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3563845Z E1204 09:51:46.684000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3563985Z E1204 09:51:46.686000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3564297Z E1204 09:51:46.686000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3564419Z E1204 09:51:46.686000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3564562Z E1204 09:51:46.698000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3564855Z E1204 09:51:46.698000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3564979Z E1204 09:51:46.698000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3565123Z E1204 09:51:46.700000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3565410Z E1204 09:51:46.700000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3565547Z E1204 09:51:46.700000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3565688Z E1204 09:51:46.722000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3566005Z E1204 09:51:46.722000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3566128Z E1204 09:51:46.722000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3566273Z E1204 09:51:46.724000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3566563Z E1204 09:51:46.724000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3566687Z E1204 09:51:46.724000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3566827Z E1204 09:51:48.603000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3567118Z E1204 09:51:48.603000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3567260Z E1204 09:51:48.603000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3567400Z E1204 09:51:48.642000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3567689Z E1204 09:51:48.642000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3567811Z E1204 09:51:48.642000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3567852Z PASSED [5.6519s] [ 2%] 2025-12-04T09:54:17.3568110Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:51:50.410000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3568423Z E1204 09:51:50.410000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3568549Z E1204 09:51:50.410000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3568688Z E1204 09:51:50.449000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3568978Z E1204 09:51:50.449000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3569100Z E1204 09:51:50.449000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3569245Z E1204 09:51:52.299000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3569534Z E1204 09:51:52.299000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3569674Z E1204 09:51:52.299000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3569814Z E1204 09:51:52.301000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3570103Z E1204 09:51:52.301000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3570230Z E1204 09:51:52.301000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3570371Z E1204 09:51:52.321000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3570660Z E1204 09:51:52.321000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3570783Z E1204 09:51:52.321000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3570927Z E1204 09:51:52.348000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3571214Z E1204 09:51:52.348000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3571352Z E1204 09:51:52.348000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3571491Z E1204 09:51:52.371000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3571783Z E1204 09:51:52.371000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3571909Z E1204 09:51:52.371000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3572048Z E1204 09:51:52.373000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3572357Z E1204 09:51:52.373000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3572484Z E1204 09:51:52.373000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3572626Z E1204 09:51:52.375000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3572916Z E1204 09:51:52.375000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3573040Z E1204 09:51:52.375000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3573179Z E1204 09:51:52.387000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3573470Z E1204 09:51:52.387000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3573605Z E1204 09:51:52.387000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3573748Z E1204 09:51:52.389000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3574039Z E1204 09:51:52.389000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3574161Z E1204 09:51:52.389000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3574307Z E1204 09:51:52.411000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3574593Z E1204 09:51:52.411000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3574720Z E1204 09:51:52.411000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3574859Z E1204 09:51:52.413000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3575148Z E1204 09:51:52.413000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3575282Z E1204 09:51:52.413000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3575426Z E1204 09:51:54.284000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3575717Z E1204 09:51:54.284000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3575840Z E1204 09:51:54.284000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3576011Z E1204 09:51:54.323000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3576304Z E1204 09:51:54.323000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3576464Z E1204 09:51:54.323000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3576504Z PASSED [5.6511s] [ 2%] 2025-12-04T09:54:17.3576766Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:51:56.073000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3577054Z E1204 09:51:56.073000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3577178Z E1204 09:51:56.073000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3577319Z E1204 09:51:56.112000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3577608Z E1204 09:51:56.112000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3577754Z E1204 09:51:56.112000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3577893Z E1204 09:51:58.669000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3578182Z E1204 09:51:58.669000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3578305Z E1204 09:51:58.669000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3578451Z E1204 09:51:58.671000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3578737Z E1204 09:51:58.671000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3578862Z E1204 09:51:58.671000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3579001Z E1204 09:51:58.693000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3579289Z E1204 09:51:58.693000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3579426Z E1204 09:51:58.693000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3579565Z E1204 09:51:58.723000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3579855Z E1204 09:51:58.723000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3579981Z E1204 09:51:58.723000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3580123Z E1204 09:51:58.746000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3580410Z E1204 09:51:58.746000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3580559Z E1204 09:51:58.746000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3580700Z E1204 09:51:58.748000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3580992Z E1204 09:51:58.748000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3581114Z E1204 09:51:58.748000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3581254Z E1204 09:51:58.750000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3581544Z E1204 09:51:58.750000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3581668Z E1204 09:51:58.750000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3581821Z E1204 09:51:58.763000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3582107Z E1204 09:51:58.763000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3582232Z E1204 09:51:58.763000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3582371Z E1204 09:51:58.765000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3582661Z E1204 09:51:58.765000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3582789Z E1204 09:51:58.765000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3582932Z E1204 09:51:58.792000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3583224Z E1204 09:51:58.792000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3583347Z E1204 09:51:58.792000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3583503Z E1204 09:51:58.794000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3583791Z E1204 09:51:58.794000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3583916Z E1204 09:51:58.794000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3584055Z E1204 09:52:00.603000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3584346Z E1204 09:52:00.603000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3584500Z E1204 09:52:00.603000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3584640Z E1204 09:52:00.642000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3584930Z E1204 09:52:00.642000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3585054Z E1204 09:52:00.642000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3585095Z PASSED [6.4045s] [ 2%] 2025-12-04T09:54:17.3585355Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:52:02.440000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3585646Z E1204 09:52:02.440000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3585768Z E1204 09:52:02.440000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3585949Z E1204 09:52:02.475000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3586239Z E1204 09:52:02.475000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3586363Z E1204 09:52:02.475000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3586505Z E1204 09:52:04.178000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3586795Z E1204 09:52:04.178000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3586922Z E1204 09:52:04.178000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3587061Z E1204 09:52:04.180000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3587357Z E1204 09:52:04.180000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3587480Z E1204 09:52:04.180000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3587637Z E1204 09:52:04.251000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3587925Z E1204 09:52:04.251000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3588050Z E1204 09:52:04.251000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3588192Z E1204 09:52:04.277000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3588479Z E1204 09:52:04.277000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3588633Z E1204 09:52:04.277000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3588774Z E1204 09:52:04.300000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3589065Z E1204 09:52:04.300000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3589188Z E1204 09:52:04.300000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3589331Z E1204 09:52:04.303000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3589625Z E1204 09:52:04.303000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3589752Z E1204 09:52:04.303000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3589894Z E1204 09:52:04.306000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3590197Z E1204 09:52:04.306000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3590321Z E1204 09:52:04.306000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3590462Z E1204 09:52:04.318000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3590754Z E1204 09:52:04.318000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3590876Z E1204 09:52:04.318000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3591020Z E1204 09:52:04.320000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3591306Z E1204 09:52:04.320000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3591433Z E1204 09:52:04.320000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3591575Z E1204 09:52:04.342000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3591885Z E1204 09:52:04.342000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3592012Z E1204 09:52:04.342000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3592153Z E1204 09:52:04.344000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3592445Z E1204 09:52:04.344000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3592568Z E1204 09:52:04.344000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3592733Z E1204 09:52:06.197000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3593020Z E1204 09:52:06.197000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3593146Z E1204 09:52:06.197000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3593285Z E1204 09:52:06.233000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3593575Z E1204 09:52:06.233000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3593702Z E1204 09:52:06.233000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3593741Z PASSED [5.4838s] [ 2%] 2025-12-04T09:54:17.3594005Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:52:07.965000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3594307Z E1204 09:52:07.965000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3594434Z E1204 09:52:07.965000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3594575Z E1204 09:52:08.005000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3594867Z E1204 09:52:08.005000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3594989Z E1204 09:52:08.005000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3595132Z E1204 09:52:09.718000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3595421Z E1204 09:52:09.718000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3595543Z E1204 09:52:09.718000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3595699Z E1204 09:52:09.720000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3596011Z E1204 09:52:09.720000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3596137Z E1204 09:52:09.720000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3596277Z E1204 09:52:09.740000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3596571Z E1204 09:52:09.740000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3596693Z E1204 09:52:09.740000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3596864Z E1204 09:52:09.766000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3597152Z E1204 09:52:09.766000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3597276Z E1204 09:52:09.766000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3597417Z E1204 09:52:09.788000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3597705Z E1204 09:52:09.788000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3597832Z E1204 09:52:09.788000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3597971Z E1204 09:52:09.790000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3598275Z E1204 09:52:09.790000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3598398Z E1204 09:52:09.790000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3598545Z E1204 09:52:09.792000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3598843Z E1204 09:52:09.792000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3598967Z E1204 09:52:09.792000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3599111Z E1204 09:52:09.804000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3599402Z E1204 09:52:09.804000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3599528Z E1204 09:52:09.804000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3599667Z E1204 09:52:09.806000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3599969Z E1204 09:52:09.806000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3600091Z E1204 09:52:09.806000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3600234Z E1204 09:52:09.828000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3600521Z E1204 09:52:09.828000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3600647Z E1204 09:52:09.828000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3600812Z E1204 09:52:09.830000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3601106Z E1204 09:52:09.830000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3601232Z E1204 09:52:09.830000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3601372Z E1204 09:52:11.700000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3601662Z E1204 09:52:11.700000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3601783Z E1204 09:52:11.700000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3601928Z E1204 09:52:11.742000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3602218Z E1204 09:52:11.742000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3602355Z E1204 09:52:11.742000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3602397Z PASSED [5.5490s] [ 2%] 2025-12-04T09:54:17.3602656Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:52:13.502000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3602951Z E1204 09:52:13.502000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3603074Z E1204 09:52:13.502000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3603218Z E1204 09:52:13.540000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3603506Z E1204 09:52:13.540000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3603631Z E1204 09:52:13.540000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3603770Z E1204 09:52:15.277000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3604072Z E1204 09:52:15.277000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3604197Z E1204 09:52:15.277000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3604336Z E1204 09:52:15.280000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3604625Z E1204 09:52:15.280000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3604748Z E1204 09:52:15.280000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3604910Z E1204 09:52:15.301000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3605196Z E1204 09:52:15.301000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3605325Z E1204 09:52:15.301000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3605465Z E1204 09:52:15.328000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3605754Z E1204 09:52:15.328000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3605879Z E1204 09:52:15.328000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3606053Z E1204 09:52:15.351000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3606341Z E1204 09:52:15.351000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3606488Z E1204 09:52:15.351000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3606629Z E1204 09:52:15.353000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3606918Z E1204 09:52:15.353000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3607045Z E1204 09:52:15.353000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3607186Z E1204 09:52:15.355000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3607478Z E1204 09:52:15.355000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3607601Z E1204 09:52:15.355000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3607743Z E1204 09:52:15.367000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3608035Z E1204 09:52:15.367000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3608171Z E1204 09:52:15.367000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3608316Z E1204 09:52:15.369000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3608603Z E1204 09:52:15.369000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3608728Z E1204 09:52:15.369000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3608867Z E1204 09:52:15.391000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3609182Z E1204 09:52:15.391000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3609307Z E1204 09:52:15.391000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3609449Z E1204 09:52:15.393000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3609739Z E1204 09:52:15.393000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3609862Z E1204 09:52:15.393000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3610008Z E1204 09:52:17.253000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3610295Z E1204 09:52:17.253000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3610439Z E1204 09:52:17.253000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3610578Z E1204 09:52:17.290000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3610868Z E1204 09:52:17.290000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3610991Z E1204 09:52:17.290000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3611033Z PASSED [5.5030s] [ 2%] 2025-12-04T09:54:17.3611292Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:52:19.776000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3611584Z E1204 09:52:19.776000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3611709Z E1204 09:52:19.776000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3611850Z E1204 09:52:19.817000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3612144Z E1204 09:52:19.817000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3612278Z E1204 09:52:19.817000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3612421Z E1204 09:52:21.577000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3612707Z E1204 09:52:21.577000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3612831Z E1204 09:52:21.577000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3612970Z E1204 09:52:21.579000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3613282Z E1204 09:52:21.579000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3613409Z E1204 09:52:21.579000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3620072Z E1204 09:52:21.599000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3620397Z E1204 09:52:21.599000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3620523Z E1204 09:52:21.599000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3620678Z E1204 09:52:21.626000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3620972Z E1204 09:52:21.626000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3621137Z E1204 09:52:21.626000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3621276Z E1204 09:52:21.649000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3621567Z E1204 09:52:21.649000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3621693Z E1204 09:52:21.649000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3621832Z E1204 09:52:21.651000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3622122Z E1204 09:52:21.651000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3622246Z E1204 09:52:21.651000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3622387Z E1204 09:52:21.653000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3622677Z E1204 09:52:21.653000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3622820Z E1204 09:52:21.653000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3622960Z E1204 09:52:21.665000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3623251Z E1204 09:52:21.665000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3623374Z E1204 09:52:21.665000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3623513Z E1204 09:52:21.667000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3623828Z E1204 09:52:21.667000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3623950Z E1204 09:52:21.667000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3624093Z E1204 09:52:21.689000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3624380Z E1204 09:52:21.689000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3624504Z E1204 09:52:21.689000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3624644Z E1204 09:52:21.691000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3624942Z E1204 09:52:21.691000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3625079Z E1204 09:52:21.691000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3625221Z E1204 09:52:23.531000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3625514Z E1204 09:52:23.531000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3625637Z E1204 09:52:23.531000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3625780Z E1204 09:52:23.569000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3626098Z E1204 09:52:23.569000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3626225Z E1204 09:52:23.569000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3626269Z PASSED [6.3970s] [ 2%] 2025-12-04T09:54:17.3626533Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:52:25.433000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3626822Z E1204 09:52:25.433000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3626960Z E1204 09:52:25.433000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3627100Z E1204 09:52:25.474000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3627388Z E1204 09:52:25.474000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3627513Z E1204 09:52:25.474000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3627654Z E1204 09:52:27.215000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3627977Z E1204 09:52:27.215000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3628100Z E1204 09:52:27.215000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3628242Z E1204 09:52:27.217000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3628529Z E1204 09:52:27.217000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3628654Z E1204 09:52:27.217000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3628795Z E1204 09:52:27.238000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3629087Z E1204 09:52:27.238000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3629228Z E1204 09:52:27.238000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3629367Z E1204 09:52:27.266000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3629655Z E1204 09:52:27.266000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3629777Z E1204 09:52:27.266000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3629922Z E1204 09:52:27.289000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3630212Z E1204 09:52:27.289000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3630339Z E1204 09:52:27.289000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3630477Z E1204 09:52:27.291000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3630767Z E1204 09:52:27.291000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3630906Z E1204 09:52:27.291000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3631046Z E1204 09:52:27.293000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3631337Z E1204 09:52:27.293000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3631460Z E1204 09:52:27.293000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3631603Z E1204 09:52:27.305000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3631914Z E1204 09:52:27.305000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3632040Z E1204 09:52:27.305000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3632183Z E1204 09:52:27.307000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3632477Z E1204 09:52:27.307000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3632601Z E1204 09:52:27.307000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3632740Z E1204 09:52:27.330000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3633031Z E1204 09:52:27.330000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3633154Z E1204 09:52:27.330000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3633307Z E1204 09:52:27.331000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3633596Z E1204 09:52:27.331000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3633721Z E1204 09:52:27.331000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3633861Z E1204 09:52:30.410000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3634153Z E1204 09:52:30.410000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3634279Z E1204 09:52:30.410000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3634420Z E1204 09:52:30.451000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3634709Z E1204 09:52:30.451000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3634833Z E1204 09:52:30.451000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3634888Z PASSED [6.8441s] [ 2%] 2025-12-04T09:54:17.3635150Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:52:32.291000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3635441Z E1204 09:52:32.291000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3635564Z E1204 09:52:32.291000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3635709Z E1204 09:52:32.338000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3636136Z E1204 09:52:32.338000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3636262Z E1204 09:52:32.338000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3636404Z E1204 09:52:34.215000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3636694Z E1204 09:52:34.215000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3636819Z E1204 09:52:34.215000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3636958Z E1204 09:52:34.217000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3637249Z E1204 09:52:34.217000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3637372Z E1204 09:52:34.217000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3637526Z E1204 09:52:34.239000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3637812Z E1204 09:52:34.239000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3637936Z E1204 09:52:34.239000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3638080Z E1204 09:52:34.266000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3638365Z E1204 09:52:34.266000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3638488Z E1204 09:52:34.266000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3638629Z E1204 09:52:34.289000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3638920Z E1204 09:52:34.289000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3639043Z E1204 09:52:34.289000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3639202Z E1204 09:52:34.291000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3639491Z E1204 09:52:34.291000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3639617Z E1204 09:52:34.291000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3639760Z E1204 09:52:34.293000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3640047Z E1204 09:52:34.293000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3640198Z E1204 09:52:34.293000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3640337Z E1204 09:52:34.305000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3640629Z E1204 09:52:34.305000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3640749Z E1204 09:52:34.305000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3640890Z E1204 09:52:34.307000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3641180Z E1204 09:52:34.307000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3641305Z E1204 09:52:34.307000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3641457Z E1204 09:52:34.329000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3641743Z E1204 09:52:34.329000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3641866Z E1204 09:52:34.329000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3642004Z E1204 09:52:34.331000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3642294Z E1204 09:52:34.331000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3642416Z E1204 09:52:34.331000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3642558Z E1204 09:52:36.157000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3642844Z E1204 09:52:36.157000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3642969Z E1204 09:52:36.157000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3643122Z E1204 09:52:36.197000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3643414Z E1204 09:52:36.197000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3643537Z E1204 09:52:36.197000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3643577Z PASSED [5.7822s] [ 2%] 2025-12-04T09:54:17.3643838Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:52:38.043000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3644124Z E1204 09:52:38.043000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3644276Z E1204 09:52:38.043000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3644417Z E1204 09:52:38.082000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3644707Z E1204 09:52:38.082000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3644829Z E1204 09:52:38.082000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3644970Z E1204 09:52:39.806000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3645262Z E1204 09:52:39.806000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3645385Z E1204 09:52:39.806000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3645538Z E1204 09:52:39.808000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3645826Z E1204 09:52:39.808000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3645973Z E1204 09:52:39.808000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3646113Z E1204 09:52:39.830000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3646402Z E1204 09:52:39.830000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3646524Z E1204 09:52:39.830000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3646663Z E1204 09:52:39.858000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3646950Z E1204 09:52:39.858000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3647072Z E1204 09:52:39.858000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3647233Z E1204 09:52:39.882000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3647519Z E1204 09:52:39.882000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3647642Z E1204 09:52:39.882000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3647782Z E1204 09:52:39.884000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3648074Z E1204 09:52:39.884000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3648197Z E1204 09:52:39.884000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3648361Z E1204 09:52:39.886000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3648650Z E1204 09:52:39.886000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3648777Z E1204 09:52:39.886000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3648917Z E1204 09:52:39.899000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3649203Z E1204 09:52:39.899000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3649328Z E1204 09:52:39.899000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3649467Z E1204 09:52:39.901000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3649768Z E1204 09:52:39.901000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3649890Z E1204 09:52:39.901000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3650030Z E1204 09:52:39.923000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3650323Z E1204 09:52:39.923000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3650447Z E1204 09:52:39.923000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3650588Z E1204 09:52:39.925000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3650875Z E1204 09:52:39.925000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3650998Z E1204 09:52:39.925000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3651139Z E1204 09:52:42.515000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3651439Z E1204 09:52:42.515000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3651562Z E1204 09:52:42.515000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3651701Z E1204 09:52:42.561000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3651988Z E1204 09:52:42.561000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3652110Z E1204 09:52:42.561000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3652150Z PASSED [6.3203s] [ 2%] 2025-12-04T09:54:17.3652434Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:52:44.357000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3652725Z E1204 09:52:44.357000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3652847Z E1204 09:52:44.357000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3652987Z E1204 09:52:44.395000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3653272Z E1204 09:52:44.395000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3653398Z E1204 09:52:44.395000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3653538Z E1204 09:52:46.059000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3653838Z E1204 09:52:46.059000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3653961Z E1204 09:52:46.059000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3654100Z E1204 09:52:46.061000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3654388Z E1204 09:52:46.061000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3654509Z E1204 09:52:46.061000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3654651Z E1204 09:52:46.079000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3654938Z E1204 09:52:46.079000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3655060Z E1204 09:52:46.079000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3655198Z E1204 09:52:46.103000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3655499Z E1204 09:52:46.103000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3655622Z E1204 09:52:46.103000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3655761Z E1204 09:52:46.125000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3656079Z E1204 09:52:46.125000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3656200Z E1204 09:52:46.125000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3656368Z E1204 09:52:46.127000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3656655Z E1204 09:52:46.127000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3656780Z E1204 09:52:46.127000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3656919Z E1204 09:52:46.129000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3657207Z E1204 09:52:46.129000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3657332Z E1204 09:52:46.129000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3657472Z E1204 09:52:46.141000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3657760Z E1204 09:52:46.141000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3657898Z E1204 09:52:46.141000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3658038Z E1204 09:52:46.142000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3658326Z E1204 09:52:46.142000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3658452Z E1204 09:52:46.142000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3658591Z E1204 09:52:46.164000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3658879Z E1204 09:52:46.164000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3659002Z E1204 09:52:46.164000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3659145Z E1204 09:52:46.166000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3659455Z E1204 09:52:46.166000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3659577Z E1204 09:52:46.166000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3659718Z E1204 09:52:48.012000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3660003Z E1204 09:52:48.012000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3660126Z E1204 09:52:48.012000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3660264Z E1204 09:52:48.050000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3660575Z E1204 09:52:48.050000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3660698Z E1204 09:52:48.050000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3660736Z PASSED [5.3456s] [ 2%] 2025-12-04T09:54:17.3660995Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:52:49.686000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3661286Z E1204 09:52:49.686000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3661414Z E1204 09:52:49.686000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3661553Z E1204 09:52:49.723000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3661855Z E1204 09:52:49.723000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3661977Z E1204 09:52:49.723000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3662118Z E1204 09:52:51.506000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3662406Z E1204 09:52:51.506000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3662529Z E1204 09:52:51.506000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3662668Z E1204 09:52:51.508000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3662956Z E1204 09:52:51.508000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3663079Z E1204 09:52:51.508000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3663217Z E1204 09:52:51.528000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3663524Z E1204 09:52:51.528000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3663648Z E1204 09:52:51.528000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3663792Z E1204 09:52:51.554000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3664077Z E1204 09:52:51.554000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3664200Z E1204 09:52:51.554000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3664338Z E1204 09:52:51.577000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3664645Z E1204 09:52:51.577000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3664769Z E1204 09:52:51.577000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3664908Z E1204 09:52:51.579000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3665196Z E1204 09:52:51.579000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3665317Z E1204 09:52:51.579000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3665460Z E1204 09:52:51.581000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3665754Z E1204 09:52:51.581000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3665892Z E1204 09:52:51.581000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3666065Z E1204 09:52:51.593000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3666352Z E1204 09:52:51.593000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3666477Z E1204 09:52:51.593000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3666616Z E1204 09:52:51.595000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3666906Z E1204 09:52:51.595000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3667027Z E1204 09:52:51.595000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3667166Z E1204 09:52:51.617000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3667453Z E1204 09:52:51.617000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3667590Z E1204 09:52:51.617000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3667728Z E1204 09:52:51.618000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3668021Z E1204 09:52:51.618000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3668144Z E1204 09:52:51.618000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3668283Z E1204 09:52:53.455000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3668594Z E1204 09:52:53.455000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3668716Z E1204 09:52:53.455000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3668857Z E1204 09:52:53.495000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3669145Z E1204 09:52:53.495000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3669267Z E1204 09:52:53.495000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3669305Z PASSED [5.6258s] [ 2%] 2025-12-04T09:54:17.3669565Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:52:55.330000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3669853Z E1204 09:52:55.330000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3669992Z E1204 09:52:55.330000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3670131Z E1204 09:52:55.368000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3670420Z E1204 09:52:55.368000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3670545Z E1204 09:52:55.368000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3670684Z E1204 09:52:57.090000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3670973Z E1204 09:52:57.090000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3671094Z E1204 09:52:57.090000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3671233Z E1204 09:52:57.092000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3671522Z E1204 09:52:57.092000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3671656Z E1204 09:52:57.092000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3671796Z E1204 09:52:57.114000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3672082Z E1204 09:52:57.114000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3672204Z E1204 09:52:57.114000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3672344Z E1204 09:52:57.142000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3672654Z E1204 09:52:57.142000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3672775Z E1204 09:52:57.142000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3672917Z E1204 09:52:57.165000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3673202Z E1204 09:52:57.165000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3673325Z E1204 09:52:57.165000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3673466Z E1204 09:52:57.167000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3673753Z E1204 09:52:57.167000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3673899Z E1204 09:52:57.167000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3674038Z E1204 09:52:57.169000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3674326Z E1204 09:52:57.169000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3674448Z E1204 09:52:57.169000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3674592Z E1204 09:52:57.181000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3674883Z E1204 09:52:57.181000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3675007Z E1204 09:52:57.181000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3675145Z E1204 09:52:57.183000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3675433Z E1204 09:52:57.183000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3675575Z E1204 09:52:57.183000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3675714Z E1204 09:52:57.206000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3676029Z E1204 09:52:57.206000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3676151Z E1204 09:52:57.206000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3676291Z E1204 09:52:57.207000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3676605Z E1204 09:52:57.207000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3676729Z E1204 09:52:57.207000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3676869Z E1204 09:52:59.105000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3677159Z E1204 09:52:59.105000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3677281Z E1204 09:52:59.105000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3677420Z E1204 09:52:59.146000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3677709Z E1204 09:52:59.146000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3677830Z E1204 09:52:59.146000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3677886Z PASSED [5.5783s] [ 2%] 2025-12-04T09:54:17.3678143Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:53:00.941000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3678430Z E1204 09:53:00.941000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3678552Z E1204 09:53:00.941000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3678693Z E1204 09:53:00.982000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3678982Z E1204 09:53:00.982000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3679104Z E1204 09:53:00.982000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3679245Z E1204 09:53:02.773000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3679531Z E1204 09:53:02.773000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3679668Z E1204 09:53:02.773000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3679808Z E1204 09:53:02.775000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3680098Z E1204 09:53:02.775000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3680220Z E1204 09:53:02.775000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3680363Z E1204 09:53:02.797000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3680668Z E1204 09:53:02.797000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3680791Z E1204 09:53:02.797000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3680932Z E1204 09:53:02.824000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3681218Z E1204 09:53:02.824000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3681343Z E1204 09:53:02.824000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3681485Z E1204 09:53:02.848000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3681771Z E1204 09:53:02.848000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3681909Z E1204 09:53:02.848000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3682049Z E1204 09:53:02.850000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3682334Z E1204 09:53:02.850000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3682456Z E1204 09:53:02.850000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3682597Z E1204 09:53:02.851000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3682889Z E1204 09:53:02.851000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3683010Z E1204 09:53:02.851000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3683149Z E1204 09:53:02.864000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3683436Z E1204 09:53:02.864000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3683569Z E1204 09:53:02.864000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3683711Z E1204 09:53:02.866000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3683997Z E1204 09:53:02.866000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3684120Z E1204 09:53:02.866000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3684259Z E1204 09:53:02.888000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3684546Z E1204 09:53:02.888000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3684688Z E1204 09:53:02.888000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3684829Z E1204 09:53:02.890000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3685117Z E1204 09:53:02.890000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3685239Z E1204 09:53:02.890000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3685379Z E1204 09:53:05.501000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3685666Z E1204 09:53:05.501000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3685790Z E1204 09:53:05.501000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3685976Z E1204 09:53:05.542000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3686265Z E1204 09:53:05.542000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3686386Z E1204 09:53:05.542000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3686425Z PASSED [6.3163s] [ 2%] 2025-12-04T09:54:17.3686685Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:53:07.246000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3686975Z E1204 09:53:07.246000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3687097Z E1204 09:53:07.246000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3687238Z E1204 09:53:07.286000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3687525Z E1204 09:53:07.286000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3687662Z E1204 09:53:07.286000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3687802Z E1204 09:53:09.022000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3688091Z E1204 09:53:09.022000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3688215Z E1204 09:53:09.022000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3688353Z E1204 09:53:09.024000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3688640Z E1204 09:53:09.024000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3688795Z E1204 09:53:09.024000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3688936Z E1204 09:53:09.045000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3689224Z E1204 09:53:09.045000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3689345Z E1204 09:53:09.045000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3689484Z E1204 09:53:09.072000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3689773Z E1204 09:53:09.072000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3689897Z E1204 09:53:09.072000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3690052Z E1204 09:53:09.094000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3690345Z E1204 09:53:09.094000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3690466Z E1204 09:53:09.094000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3690604Z E1204 09:53:09.096000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3690894Z E1204 09:53:09.096000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3691016Z E1204 09:53:09.096000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3691155Z E1204 09:53:09.098000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3691442Z E1204 09:53:09.098000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3691565Z E1204 09:53:09.098000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3691718Z E1204 09:53:09.110000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3692006Z E1204 09:53:09.110000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3692130Z E1204 09:53:09.110000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3692269Z E1204 09:53:09.112000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3692558Z E1204 09:53:09.112000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3692702Z E1204 09:53:09.112000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3692843Z E1204 09:53:09.133000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3693130Z E1204 09:53:09.133000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3693254Z E1204 09:53:09.133000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3693393Z E1204 09:53:09.135000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3693685Z E1204 09:53:09.135000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3693808Z E1204 09:53:09.135000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3693948Z E1204 09:53:10.926000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3694247Z E1204 09:53:10.926000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3694371Z E1204 09:53:10.926000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3694511Z E1204 09:53:10.965000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3694805Z E1204 09:53:10.965000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3694928Z E1204 09:53:10.965000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3694967Z PASSED [5.5316s] [ 2%] 2025-12-04T09:54:17.3695225Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:53:12.859000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3695511Z E1204 09:53:12.859000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3695634Z E1204 09:53:12.859000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3695785Z E1204 09:53:12.898000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3696102Z E1204 09:53:12.898000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3696226Z E1204 09:53:12.898000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3696365Z E1204 09:53:14.660000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3696652Z E1204 09:53:14.660000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3696805Z E1204 09:53:14.660000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3696949Z E1204 09:53:14.662000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3697238Z E1204 09:53:14.662000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3697361Z E1204 09:53:14.662000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3697501Z E1204 09:53:14.683000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3697790Z E1204 09:53:14.683000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3697913Z E1204 09:53:14.683000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3698052Z E1204 09:53:14.711000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3698354Z E1204 09:53:14.711000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3698476Z E1204 09:53:14.711000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3698617Z E1204 09:53:14.734000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3698905Z E1204 09:53:14.734000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3699029Z E1204 09:53:14.734000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3699170Z E1204 09:53:14.736000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3699459Z E1204 09:53:14.736000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3699580Z E1204 09:53:14.736000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3699719Z E1204 09:53:14.738000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3700023Z E1204 09:53:14.738000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3700146Z E1204 09:53:14.738000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3700287Z E1204 09:53:14.750000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3700573Z E1204 09:53:14.750000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3700696Z E1204 09:53:14.750000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3700853Z E1204 09:53:14.752000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3701141Z E1204 09:53:14.752000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3701264Z E1204 09:53:14.752000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3701405Z E1204 09:53:14.774000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3701694Z E1204 09:53:14.774000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3701817Z E1204 09:53:14.774000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3701957Z E1204 09:53:14.776000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3702259Z E1204 09:53:14.776000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3702383Z E1204 09:53:14.776000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3702521Z E1204 09:53:16.733000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3702809Z E1204 09:53:16.733000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3702932Z E1204 09:53:16.733000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3703071Z E1204 09:53:16.773000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3703360Z E1204 09:53:16.773000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3703481Z E1204 09:53:16.773000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3703520Z PASSED [5.7224s] [ 2%] 2025-12-04T09:54:17.3703784Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:53:18.509000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3704082Z E1204 09:53:18.509000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3704204Z E1204 09:53:18.509000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3704344Z E1204 09:53:18.548000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3704631Z E1204 09:53:18.548000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3704753Z E1204 09:53:18.548000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3704919Z E1204 09:53:20.743000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3705210Z E1204 09:53:20.743000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3705334Z E1204 09:53:20.743000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3705472Z E1204 09:53:20.745000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3705760Z E1204 09:53:20.745000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3705885Z E1204 09:53:20.745000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3706054Z E1204 09:53:20.765000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3706365Z E1204 09:53:20.765000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3706489Z E1204 09:53:20.765000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3706627Z E1204 09:53:20.792000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3706918Z E1204 09:53:20.792000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3707041Z E1204 09:53:20.792000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3707181Z E1204 09:53:20.814000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3707468Z E1204 09:53:20.814000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3707589Z E1204 09:53:20.814000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3707730Z E1204 09:53:20.816000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3708032Z E1204 09:53:20.816000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3708155Z E1204 09:53:20.816000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3708295Z E1204 09:53:20.818000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3708585Z E1204 09:53:20.818000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3708708Z E1204 09:53:20.818000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3708874Z E1204 09:53:20.830000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3709162Z E1204 09:53:20.830000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3709285Z E1204 09:53:20.830000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3709424Z E1204 09:53:20.832000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3709710Z E1204 09:53:20.832000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3709832Z E1204 09:53:20.832000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3709973Z E1204 09:53:20.854000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3710265Z E1204 09:53:20.854000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3710406Z E1204 09:53:20.854000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3710544Z E1204 09:53:20.856000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3710833Z E1204 09:53:20.856000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3710956Z E1204 09:53:20.856000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3711095Z E1204 09:53:22.724000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3711383Z E1204 09:53:22.724000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3711505Z E1204 09:53:22.724000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3711643Z E1204 09:53:22.763000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3711932Z E1204 09:53:22.763000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3712066Z E1204 09:53:22.763000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3712105Z PASSED [6.0853s] [ 2%] 2025-12-04T09:54:17.3712365Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:53:24.595000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3712657Z E1204 09:53:24.595000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3712779Z E1204 09:53:24.595000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3712939Z E1204 09:53:24.630000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3713226Z E1204 09:53:24.630000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3713349Z E1204 09:53:24.630000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3713489Z E1204 09:53:27.067000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3713776Z E1204 09:53:27.067000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3713900Z E1204 09:53:27.067000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3714040Z E1204 09:53:27.069000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3714325Z E1204 09:53:27.069000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3714462Z E1204 09:53:27.069000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3714601Z E1204 09:53:27.092000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3714894Z E1204 09:53:27.092000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3715017Z E1204 09:53:27.092000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3715157Z E1204 09:53:27.121000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3715444Z E1204 09:53:27.121000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3715568Z E1204 09:53:27.121000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3715708Z E1204 09:53:27.145000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3716025Z E1204 09:53:27.145000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3716165Z E1204 09:53:27.145000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3716305Z E1204 09:53:27.147000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3716591Z E1204 09:53:27.147000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3716712Z E1204 09:53:27.147000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3716851Z E1204 09:53:27.149000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3717174Z E1204 09:53:27.149000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3717299Z E1204 09:53:27.149000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3717440Z E1204 09:53:27.161000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3717729Z E1204 09:53:27.161000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3717852Z E1204 09:53:27.161000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3717993Z E1204 09:53:27.163000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3718280Z E1204 09:53:27.163000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3718418Z E1204 09:53:27.163000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3718558Z E1204 09:53:27.186000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3718844Z E1204 09:53:27.186000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3718970Z E1204 09:53:27.186000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3719109Z E1204 09:53:27.188000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3719404Z E1204 09:53:27.188000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3719529Z E1204 09:53:27.188000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3719667Z E1204 09:53:29.189000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3719954Z E1204 09:53:29.189000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3720093Z E1204 09:53:29.189000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3720234Z E1204 09:53:29.227000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3720521Z E1204 09:53:29.227000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3720644Z E1204 09:53:29.227000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3720682Z PASSED [6.4870s] [ 2%] 2025-12-04T09:54:17.3720940Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:53:31.131000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3721251Z E1204 09:53:31.131000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3721374Z E1204 09:53:31.131000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3721515Z E1204 09:53:31.168000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3721805Z E1204 09:53:31.168000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3721928Z E1204 09:53:31.168000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3722069Z E1204 09:53:32.968000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3722357Z E1204 09:53:32.968000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3722491Z E1204 09:53:32.968000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3722632Z E1204 09:53:32.970000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3722919Z E1204 09:53:32.970000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3723043Z E1204 09:53:32.970000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3723185Z E1204 09:53:32.989000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3723473Z E1204 09:53:32.989000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3723596Z E1204 09:53:32.989000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3723737Z E1204 09:53:33.014000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3724027Z E1204 09:53:33.014000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3724159Z E1204 09:53:33.014000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3724300Z E1204 09:53:33.035000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3724586Z E1204 09:53:33.035000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3724708Z E1204 09:53:33.035000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3724848Z E1204 09:53:33.037000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3725153Z E1204 09:53:33.037000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3725276Z E1204 09:53:33.037000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3725416Z E1204 09:53:33.039000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3725706Z E1204 09:53:33.039000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3725827Z E1204 09:53:33.039000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3725996Z E1204 09:53:33.051000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3726285Z E1204 09:53:33.051000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3726427Z E1204 09:53:33.051000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3726567Z E1204 09:53:33.053000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3726853Z E1204 09:53:33.053000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3726975Z E1204 09:53:33.053000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3727117Z E1204 09:53:33.074000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3727404Z E1204 09:53:33.074000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3727526Z E1204 09:53:33.074000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3727666Z E1204 09:53:33.076000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3727953Z E1204 09:53:33.076000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3728092Z E1204 09:53:33.076000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3728234Z E1204 09:53:34.969000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3728520Z E1204 09:53:34.969000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3728644Z E1204 09:53:34.969000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3728782Z E1204 09:53:35.004000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3729093Z E1204 09:53:35.004000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3729216Z E1204 09:53:35.004000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3729255Z PASSED [5.7660s] [ 2%] 2025-12-04T09:54:17.3729514Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:53:36.926000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3729802Z E1204 09:53:36.926000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3729925Z E1204 09:53:36.926000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3730064Z E1204 09:53:36.961000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3730356Z E1204 09:53:36.961000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3730495Z E1204 09:53:36.961000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3730635Z E1204 09:53:38.701000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3730920Z E1204 09:53:38.701000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3731042Z E1204 09:53:38.701000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3731184Z E1204 09:53:38.703000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3731471Z E1204 09:53:38.703000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3731593Z E1204 09:53:38.703000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3731733Z E1204 09:53:38.722000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3732020Z E1204 09:53:38.722000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3732155Z E1204 09:53:38.722000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3732295Z E1204 09:53:38.746000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3732586Z E1204 09:53:38.746000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3732710Z E1204 09:53:38.746000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3732849Z E1204 09:53:38.768000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3733155Z E1204 09:53:38.768000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3733277Z E1204 09:53:38.768000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3733419Z E1204 09:53:38.770000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3733708Z E1204 09:53:38.770000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3733829Z E1204 09:53:38.770000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3733970Z E1204 09:53:38.772000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3734259Z E1204 09:53:38.772000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3734382Z E1204 09:53:38.772000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3734540Z E1204 09:53:38.783000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3734833Z E1204 09:53:38.783000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3734958Z E1204 09:53:38.783000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3735100Z E1204 09:53:38.785000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3735390Z E1204 09:53:38.785000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3735514Z E1204 09:53:38.785000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3735654Z E1204 09:53:38.807000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3735975Z E1204 09:53:38.807000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3736098Z E1204 09:53:38.807000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3736253Z E1204 09:53:38.809000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3736543Z E1204 09:53:38.809000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3736668Z E1204 09:53:38.809000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3736807Z E1204 09:53:40.651000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3737098Z E1204 09:53:40.651000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3737246Z E1204 09:53:40.651000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3737386Z E1204 09:53:40.686000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3737673Z E1204 09:53:40.686000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3737796Z E1204 09:53:40.686000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3737834Z PASSED [5.6329s] [ 2%] 2025-12-04T09:54:17.3738092Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:53:42.486000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3738381Z E1204 09:53:42.486000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3738503Z E1204 09:53:42.486000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3738657Z E1204 09:53:42.524000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3738947Z E1204 09:53:42.524000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3739070Z E1204 09:53:42.524000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3739210Z E1204 09:53:44.294000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3739497Z E1204 09:53:44.294000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3739622Z E1204 09:53:44.294000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3739762Z E1204 09:53:44.296000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3740047Z E1204 09:53:44.296000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3740171Z E1204 09:53:44.296000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3740324Z E1204 09:53:44.317000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3740612Z E1204 09:53:44.317000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3740738Z E1204 09:53:44.317000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3740876Z E1204 09:53:44.344000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3741167Z E1204 09:53:44.344000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3741308Z E1204 09:53:44.344000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3741448Z E1204 09:53:44.367000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3741735Z E1204 09:53:44.367000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3741858Z E1204 09:53:44.367000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3741996Z E1204 09:53:44.369000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3742284Z E1204 09:53:44.369000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3742410Z E1204 09:53:44.369000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3742562Z E1204 09:53:44.371000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3742853Z E1204 09:53:44.371000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3742975Z E1204 09:53:44.371000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3743117Z E1204 09:53:44.383000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3743411Z E1204 09:53:44.383000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3743538Z E1204 09:53:44.383000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3743679Z E1204 09:53:44.385000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3743971Z E1204 09:53:44.385000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3744097Z E1204 09:53:44.385000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3744249Z E1204 09:53:44.407000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3744541Z E1204 09:53:44.407000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3744665Z E1204 09:53:44.407000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3744807Z E1204 09:53:44.409000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3745095Z E1204 09:53:44.409000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3745224Z E1204 09:53:44.409000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3745387Z E1204 09:53:46.350000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3745683Z E1204 09:53:46.350000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3745807Z E1204 09:53:46.350000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3745982Z E1204 09:53:46.389000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3746273Z E1204 09:53:46.389000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3746398Z E1204 09:53:46.389000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3746439Z PASSED [5.6350s] [ 2%] 2025-12-04T09:54:17.3746698Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:53:48.980000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3747007Z E1204 09:53:48.980000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3747130Z E1204 09:53:48.980000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3747275Z E1204 09:53:49.022000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3747565Z E1204 09:53:49.022000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3747690Z E1204 09:53:49.022000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3747834Z E1204 09:53:50.743000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3748123Z E1204 09:53:50.743000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3748248Z E1204 09:53:50.743000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3748403Z E1204 09:53:50.745000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3748693Z E1204 09:53:50.745000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3748817Z E1204 09:53:50.745000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3748959Z E1204 09:53:50.764000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3749246Z E1204 09:53:50.764000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3749407Z E1204 09:53:50.764000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3749548Z E1204 09:53:50.788000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3749837Z E1204 09:53:50.788000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3749962Z E1204 09:53:50.788000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3750102Z E1204 09:53:50.810000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3750394Z E1204 09:53:50.810000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3750519Z E1204 09:53:50.810000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3750661Z E1204 09:53:50.812000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3750960Z E1204 09:53:50.812000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3751086Z E1204 09:53:50.812000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3751226Z E1204 09:53:50.814000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3751519Z E1204 09:53:50.814000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3751647Z E1204 09:53:50.814000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3751788Z E1204 09:53:50.826000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3752079Z E1204 09:53:50.826000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3752202Z E1204 09:53:50.826000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3752349Z E1204 09:53:50.828000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3752654Z E1204 09:53:50.828000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3752780Z E1204 09:53:50.828000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3752919Z E1204 09:53:50.849000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3753209Z E1204 09:53:50.849000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3753333Z E1204 09:53:50.849000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3753497Z E1204 09:53:50.851000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3753797Z E1204 09:53:50.851000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3753921Z E1204 09:53:50.851000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3754062Z E1204 09:53:52.717000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3754349Z E1204 09:53:52.717000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3754476Z E1204 09:53:52.717000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3754618Z E1204 09:53:52.756000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3754908Z E1204 09:53:52.756000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3755047Z E1204 09:53:52.756000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3755086Z PASSED [6.4071s] [ 2%] 2025-12-04T09:54:17.3755347Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:53:54.523000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3755638Z E1204 09:53:54.523000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3755764Z E1204 09:53:54.523000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3755907Z E1204 09:53:54.562000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3756235Z E1204 09:53:54.562000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3756358Z E1204 09:53:54.562000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3756501Z E1204 09:53:56.368000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3756810Z E1204 09:53:56.368000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3756935Z E1204 09:53:56.368000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3757079Z E1204 09:53:56.370000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3757365Z E1204 09:53:56.370000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3757491Z E1204 09:53:56.370000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3757656Z E1204 09:53:56.391000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3757946Z E1204 09:53:56.391000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3758070Z E1204 09:53:56.391000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3758213Z E1204 09:53:56.418000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3758499Z E1204 09:53:56.418000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3758627Z E1204 09:53:56.418000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3758768Z E1204 09:53:56.442000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3759056Z E1204 09:53:56.442000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3759197Z E1204 09:53:56.442000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3759338Z E1204 09:53:56.443000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3759630Z E1204 09:53:56.443000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3759754Z E1204 09:53:56.443000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3759896Z E1204 09:53:56.445000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3760186Z E1204 09:53:56.445000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3760309Z E1204 09:53:56.445000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3760451Z E1204 09:53:56.458000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3760750Z E1204 09:53:56.458000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3760876Z E1204 09:53:56.458000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3761017Z E1204 09:53:56.460000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3761309Z E1204 09:53:56.460000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3761431Z E1204 09:53:56.460000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3761574Z E1204 09:53:56.485000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3762710Z E1204 09:53:56.485000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3762840Z E1204 09:53:56.485000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3762982Z E1204 09:53:56.487000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3763271Z E1204 09:53:56.487000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3763397Z E1204 09:53:56.487000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3763538Z E1204 09:53:58.403000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3763825Z E1204 09:53:58.403000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3763964Z E1204 09:53:58.403000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3764106Z E1204 09:53:58.443000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3764397Z E1204 09:53:58.443000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3764526Z E1204 09:53:58.443000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3764566Z PASSED [5.7290s] [ 2%] 2025-12-04T09:54:17.3764826Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:54:00.260000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3765119Z E1204 09:54:00.260000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3765241Z E1204 09:54:00.260000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3765383Z E1204 09:54:00.300000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3765682Z E1204 09:54:00.300000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3765806Z E1204 09:54:00.300000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3765980Z E1204 09:54:02.039000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3766273Z E1204 09:54:02.039000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3766396Z E1204 09:54:02.039000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3766539Z E1204 09:54:02.041000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3766870Z E1204 09:54:02.041000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3766994Z E1204 09:54:02.041000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3767137Z E1204 09:54:02.061000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3767424Z E1204 09:54:02.061000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3767550Z E1204 09:54:02.061000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3767692Z E1204 09:54:02.088000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3767982Z E1204 09:54:02.088000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3768119Z E1204 09:54:02.088000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3768262Z E1204 09:54:02.111000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3768552Z E1204 09:54:02.111000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3768678Z E1204 09:54:02.111000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3768821Z E1204 09:54:02.113000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3769111Z E1204 09:54:02.113000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3769236Z E1204 09:54:02.113000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3769376Z E1204 09:54:02.115000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3769669Z E1204 09:54:02.115000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3769807Z E1204 09:54:02.115000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3769949Z E1204 09:54:02.127000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3770243Z E1204 09:54:02.127000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3770366Z E1204 09:54:02.127000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3770508Z E1204 09:54:02.129000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3770815Z E1204 09:54:02.129000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3770941Z E1204 09:54:02.129000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3771080Z E1204 09:54:02.151000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3771374Z E1204 09:54:02.151000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3771498Z E1204 09:54:02.151000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3771639Z E1204 09:54:02.153000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3771930Z E1204 09:54:02.153000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3772064Z E1204 09:54:02.153000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3772209Z E1204 09:54:04.000000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3772497Z E1204 09:54:04.000000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3772622Z E1204 09:54:04.000000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3772764Z E1204 09:54:04.039000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3773052Z E1204 09:54:04.039000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3773176Z E1204 09:54:04.039000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3773216Z PASSED [5.5071s] [ 2%] 2025-12-04T09:54:17.3773480Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_exhaustive_dtypes E1204 09:54:05.802000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3773776Z E1204 09:54:05.802000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3773911Z E1204 09:54:05.802000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3774054Z E1204 09:54:05.842000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3774343Z E1204 09:54:05.842000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3774465Z E1204 09:54:05.842000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3774606Z E1204 09:54:07.601000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3774914Z E1204 09:54:07.601000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3775038Z E1204 09:54:07.601000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3775179Z E1204 09:54:07.603000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3775468Z E1204 09:54:07.603000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3775594Z E1204 09:54:07.603000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3775739Z E1204 09:54:07.623000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3776074Z E1204 09:54:07.623000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3776212Z E1204 09:54:07.623000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3776354Z E1204 09:54:07.650000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3776641Z E1204 09:54:07.650000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3776766Z E1204 09:54:07.650000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3776908Z E1204 09:54:07.673000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3777198Z E1204 09:54:07.673000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3777324Z E1204 09:54:07.673000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3777466Z E1204 09:54:07.675000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3777756Z E1204 09:54:07.675000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3777895Z E1204 09:54:07.675000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3778037Z E1204 09:54:07.677000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3778328Z E1204 09:54:07.677000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3778453Z E1204 09:54:07.677000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3778594Z E1204 09:54:07.689000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3778910Z E1204 09:54:07.689000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3779034Z E1204 09:54:07.689000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3779177Z E1204 09:54:07.691000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3779469Z E1204 09:54:07.691000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 81920 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3779592Z E1204 09:54:07.691000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3779734Z E1204 09:54:07.712000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3780024Z E1204 09:54:07.712000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 98304 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3780149Z E1204 09:54:07.712000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3780304Z E1204 09:54:07.714000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3780596Z E1204 09:54:07.714000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 131072 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3780719Z E1204 09:54:07.714000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3780863Z E1204 09:54:09.591000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3781153Z E1204 09:54:09.591000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3781276Z E1204 09:54:09.591000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3781418Z E1204 09:54:09.629000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T09:54:17.3781705Z E1204 09:54:09.629000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T09:54:17.3781828Z E1204 09:54:09.629000 83973 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T09:54:17.3781886Z PASSED [5.6322s] [ 2%] 2025-12-04T09:54:17.3782179Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3782459Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3782744Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3783041Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3783316Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3783593Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3783869Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3784144Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3784420Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3784708Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0002s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3784983Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3785256Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3785538Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3785812Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3786119Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3786394Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3786682Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3786957Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3787230Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3787504Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3787803Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3788080Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3788356Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3788629Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3788909Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0002s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3789183Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3789480Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0002s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3789759Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0002s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3790034Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0002s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3790308Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0002s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3790583Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3790859Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3791144Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0002s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3791421Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0002s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3791698Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0002s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3791974Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0002s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3792276Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3792550Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0002s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3792827Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3793103Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0002s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3793378Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3793654Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3793939Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3794217Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3794498Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3794774Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3795052Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3795326Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3795603Z inductor/test_pattern_matcher.py::TestPatternMatcher::test_remove_noop_pass_with_remove_passes SKIPPED [0.0003s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T09:54:17.3795617Z 2025-12-04T09:54:17.3795677Z =================================== FAILURES =================================== 2025-12-04T09:54:17.3795762Z _______________________ TestPatternMatcher.test_mixed_mm _______________________ 2025-12-04T09:54:17.3795813Z Traceback (most recent call last): 2025-12-04T09:54:17.3795983Z File "/var/lib/jenkins/pytorch/test/inductor/test_pattern_matcher.py", line 369, in test_mixed_mm 2025-12-04T09:54:17.3796040Z self._test_mixed_impl(fn, args, True, False) 2025-12-04T09:54:17.3796173Z File "/var/lib/jenkins/pytorch/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:54:17.3796255Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:54:17.3796329Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:54:17.3796368Z Searched string: 2025-12-04T09:54:17.3796431Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:54:17.3796462Z 2025-12-04T09:54:17.3796518Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:54:17.3796520Z 2025-12-04T09:54:17.3796579Z a_mask = offs_k[None, :] < (K - k_idx * BLOCK_K) 2025-12-04T09:54:17.3796639Z b_mask = offs_k[:, None] < (K - k_idx * BLOCK_K) 2025-12-04T09:54:17.3796641Z 2025-12-04T09:54:17.3796700Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:54:17.3796759Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:54:17.3796761Z 2025-12-04T09:54:17.3796805Z idx_m = offs_a_m[:, None] 2025-12-04T09:54:17.3796849Z idx_n = a_k_idx_vals 2025-12-04T09:54:17.3796894Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.3796954Z a = tl.load(A + (xindex), mask=a_mask, other=0.0) 2025-12-04T09:54:17.3796956Z 2025-12-04T09:54:17.3796999Z idx_m = b_k_idx_vals 2025-12-04T09:54:17.3797045Z idx_n = offs_b_n[None, :] 2025-12-04T09:54:17.3797086Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.3797146Z b = tl.load(B + (xindex), mask=b_mask, other=0.0) 2025-12-04T09:54:17.3797148Z 2025-12-04T09:54:17.3797149Z 2025-12-04T09:54:17.3797220Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:54:17.3797243Z 2025-12-04T09:54:17.3797244Z 2025-12-04T09:54:17.3797301Z # rematerialize rm and rn to save registers 2025-12-04T09:54:17.3797356Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:54:17.3797409Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:54:17.3797449Z idx_m = rm[:, None] 2025-12-04T09:54:17.3797492Z idx_n = rn[None, :] 2025-12-04T09:54:17.3797535Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:54:17.3797537Z 2025-12-04T09:54:17.3797583Z # inductor generates a suffix 2025-12-04T09:54:17.3797624Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.3797717Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:54:17.3797758Z ''', device_str='cuda') 2025-12-04T09:54:17.3797761Z 2025-12-04T09:54:17.3797762Z 2025-12-04T09:54:17.3797811Z async_compile.wait(globals()) 2025-12-04T09:54:17.3797850Z del async_compile 2025-12-04T09:54:17.3797852Z 2025-12-04T09:54:17.3797893Z class Runner: 2025-12-04T09:54:17.3797940Z def __init__(self, partitions): 2025-12-04T09:54:17.3797990Z self.partitions = partitions 2025-12-04T09:54:17.3797992Z 2025-12-04T09:54:17.3798041Z def recursively_apply_fns(self, fns): 2025-12-04T09:54:17.3798085Z new_callables = [] 2025-12-04T09:54:17.3798137Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:54:17.3798189Z new_callables.append(fn(c)) 2025-12-04T09:54:17.3798235Z self.partitions = new_callables 2025-12-04T09:54:17.3798237Z 2025-12-04T09:54:17.3798280Z def call(self, args): 2025-12-04T09:54:17.3798321Z arg0_1, arg1_1 = args 2025-12-04T09:54:17.3798373Z args.clear() 2025-12-04T09:54:17.3798425Z assert_size_stride(arg0_1, (8, 8), (8, 1)) 2025-12-04T09:54:17.3798478Z assert_size_stride(arg1_1, (8, 8), (8, 1)) 2025-12-04T09:54:17.3798524Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:54:17.3798571Z torch.cuda.set_device(0) 2025-12-04T09:54:17.3798641Z buf0 = empty_strided_cuda((8, 8), (8, 1), torch.float32) 2025-12-04T09:54:17.3798735Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:54:17.3798779Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.3798858Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 64, stream=stream0) 2025-12-04T09:54:17.3798896Z del arg0_1 2025-12-04T09:54:17.3798962Z buf1 = empty_strided_cuda((8, 8), (8, 1), torch.float32) 2025-12-04T09:54:17.3799066Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:54:17.3799112Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.3799221Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 1, 1, 1, stream=stream0) 2025-12-04T09:54:17.3799262Z del arg1_1 2025-12-04T09:54:17.3799300Z del buf0 2025-12-04T09:54:17.3799340Z return (buf1, ) 2025-12-04T09:54:17.3799343Z 2025-12-04T09:54:17.3799389Z runner = Runner(partitions=[]) 2025-12-04T09:54:17.3799430Z call = runner.call 2025-12-04T09:54:17.3799503Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:54:17.3799506Z 2025-12-04T09:54:17.3799507Z 2025-12-04T09:54:17.3799571Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:54:17.3799628Z from torch._dynamo.testing import rand_strided 2025-12-04T09:54:17.3799695Z from torch._inductor.utils import print_performance 2025-12-04T09:54:17.3799774Z arg0_1 = rand_strided((8, 8), (8, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:54:17.3799852Z arg1_1 = rand_strided((8, 8), (8, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:54:17.3799898Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:54:17.3799972Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:54:17.3799974Z 2025-12-04T09:54:17.3799976Z 2025-12-04T09:54:17.3800024Z if __name__ == "__main__": 2025-12-04T09:54:17.3800107Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:54:17.3800191Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:54:17.3800230Z From CHECK: .to( 2025-12-04T09:54:17.3800232Z 2025-12-04T09:54:17.3800234Z 2025-12-04T09:54:17.3800311Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:17.3800443Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm 2025-12-04T09:54:17.3800445Z 2025-12-04T09:54:17.3800536Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:17.3800618Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3800667Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3800726Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3800836Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3801198Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3801241Z graph_break [] 2025-12-04T09:54:17.3801286Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3801366Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3801736Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:17.3801788Z warnings.warn( 2025-12-04T09:54:17.3802274Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:54:17.3802325Z current_size = base.storage().size() 2025-12-04T09:54:17.3802369Z Autotune Choices Stats: 2025-12-04T09:54:17.3802748Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_5", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.011599999852478504, "best_triton_pos": 0} 2025-12-04T09:54:17.3802819Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3802859Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3802913Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3803157Z triton_mm_5 0.0116 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3803400Z triton_mm_4 0.0136 ms 85.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3803633Z triton_mm_3 0.0216 ms 53.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3803869Z triton_mm_2 0.0237 ms 48.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3804100Z triton_mm_1 0.0254 ms 45.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3804343Z triton_mm_0 0.0262 ms 44.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3804478Z SingleProcess AUTOTUNE benchmarking takes 0.0640 seconds and 0.2316 seconds precompiling for 6 choices 2025-12-04T09:54:17.3804520Z Autotune Choices Stats: 2025-12-04T09:54:17.3804893Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_6", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.3804934Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3804975Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3805026Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3805264Z triton_mm_6 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3805502Z triton_mm_11 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3805750Z triton_mm_9 0.0062 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3806012Z triton_mm_10 0.0065 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3806243Z triton_mm_8 0.0109 ms 53.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3806476Z triton_mm_7 0.0143 ms 40.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3806633Z SingleProcess AUTOTUNE benchmarking takes 0.0530 seconds and 0.1074 seconds precompiling for 6 choices 2025-12-04T09:54:17.3806677Z Autotune Choices Stats: 2025-12-04T09:54:17.3807043Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_17", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005679999943822622, "best_triton_pos": 0} 2025-12-04T09:54:17.3807086Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3807124Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3807177Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3807416Z triton_mm_17 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3807655Z triton_mm_16 0.0059 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3807891Z triton_mm_12 0.0091 ms 62.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3808136Z triton_mm_15 0.0110 ms 51.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3808370Z triton_mm_14 0.0119 ms 47.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3808603Z triton_mm_13 0.0147 ms 38.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3808733Z SingleProcess AUTOTUNE benchmarking takes 0.0597 seconds and 0.1288 seconds precompiling for 6 choices 2025-12-04T09:54:17.3808814Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3808858Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3808918Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3809023Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3809377Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3809431Z graph_break [] 2025-12-04T09:54:17.3809480Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3809556Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3809597Z Autotune Choices Stats: 2025-12-04T09:54:17.3809968Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_270", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006120000034570694, "best_triton_pos": 0} 2025-12-04T09:54:17.3810009Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3810047Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3810095Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3810362Z triton_mm_270 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3810598Z triton_mm_271 0.0064 ms 95.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3810831Z triton_mm_272 0.0067 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3811066Z triton_mm_274 0.0067 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3811301Z triton_mm_273 0.0068 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3811534Z triton_mm_275 0.0068 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3811679Z SingleProcess AUTOTUNE benchmarking takes 0.0400 seconds and 0.0709 seconds precompiling for 6 choices 2025-12-04T09:54:17.3811720Z Autotune Choices Stats: 2025-12-04T09:54:17.3812088Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_276", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006200000178068876, "best_triton_pos": 0} 2025-12-04T09:54:17.3812128Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3812168Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3812218Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3812456Z triton_mm_276 0.0062 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3812692Z triton_mm_278 0.0067 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3812924Z triton_mm_280 0.0067 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3813159Z triton_mm_281 0.0067 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3813400Z triton_mm_279 0.0068 ms 91.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3813637Z triton_mm_277 0.0068 ms 91.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3813763Z SingleProcess AUTOTUNE benchmarking takes 0.0409 seconds and 0.0696 seconds precompiling for 6 choices 2025-12-04T09:54:17.3813806Z Autotune Choices Stats: 2025-12-04T09:54:17.3814189Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_287", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.0061599998734891415, "best_triton_pos": 0} 2025-12-04T09:54:17.3814230Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3814271Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3814318Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3814559Z triton_mm_287 0.0062 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3814791Z triton_mm_284 0.0065 ms 94.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3815026Z triton_mm_285 0.0067 ms 92.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3815258Z triton_mm_283 0.0068 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3815502Z triton_mm_286 0.0068 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3815739Z triton_mm_282 0.0069 ms 89.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3815867Z SingleProcess AUTOTUNE benchmarking takes 0.0398 seconds and 0.0688 seconds precompiling for 6 choices 2025-12-04T09:54:17.3815984Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3816027Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3816087Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3816189Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3816543Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3816581Z graph_break [] 2025-12-04T09:54:17.3816625Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3816700Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3816743Z Autotune Choices Stats: 2025-12-04T09:54:17.3817124Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_299", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.3817167Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3817205Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3817253Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3817489Z triton_mm_299 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3817729Z triton_mm_294 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3817991Z triton_mm_298 0.0066 ms 88.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3818223Z triton_mm_297 0.0066 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3818457Z triton_mm_296 0.0067 ms 86.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3818689Z triton_mm_295 0.0140 ms 41.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3818820Z SingleProcess AUTOTUNE benchmarking takes 0.0535 seconds and 0.0749 seconds precompiling for 6 choices 2025-12-04T09:54:17.3818861Z Autotune Choices Stats: 2025-12-04T09:54:17.3819230Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_305", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006719999946653843, "best_triton_pos": 0} 2025-12-04T09:54:17.3819291Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3819330Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3819378Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3819616Z triton_mm_305 0.0067 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3819853Z triton_mm_303 0.0068 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3820085Z triton_mm_302 0.0069 ms 97.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3820318Z triton_mm_304 0.0070 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3820548Z triton_mm_301 0.0130 ms 51.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3820796Z triton_mm_300 0.0158 ms 42.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3820925Z SingleProcess AUTOTUNE benchmarking takes 0.0549 seconds and 0.0876 seconds precompiling for 6 choices 2025-12-04T09:54:17.3820966Z Autotune Choices Stats: 2025-12-04T09:54:17.3821330Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_309", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T09:54:17.3821368Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3821408Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3821454Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3821710Z triton_mm_309 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3821946Z triton_mm_310 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3822180Z triton_mm_307 0.0062 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3822411Z triton_mm_306 0.0062 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3822646Z triton_mm_311 0.0067 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3822892Z triton_mm_308 0.0068 ms 86.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3823016Z SingleProcess AUTOTUNE benchmarking takes 0.0415 seconds and 0.0861 seconds precompiling for 6 choices 2025-12-04T09:54:17.3823096Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3823138Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3823196Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3823298Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3823653Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3823693Z graph_break [] 2025-12-04T09:54:17.3823737Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3823813Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3823855Z Autotune Choices Stats: 2025-12-04T09:54:17.3824222Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_318", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.3824276Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3824316Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3824363Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3824599Z triton_mm_318 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3824835Z triton_mm_322 0.0060 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3825069Z triton_mm_320 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3825331Z triton_mm_319 0.0066 ms 89.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3825564Z triton_mm_321 0.0068 ms 86.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3825798Z triton_mm_323 0.0068 ms 86.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3825947Z SingleProcess AUTOTUNE benchmarking takes 0.0427 seconds and 0.0765 seconds precompiling for 6 choices 2025-12-04T09:54:17.3825990Z Autotune Choices Stats: 2025-12-04T09:54:17.3826362Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_329", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.3826407Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3826461Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3826512Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3826751Z triton_mm_329 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3826988Z triton_mm_328 0.0059 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3827223Z triton_mm_326 0.0065 ms 90.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3827456Z triton_mm_325 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3827689Z triton_mm_327 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3827921Z triton_mm_324 0.0073 ms 80.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3828064Z SingleProcess AUTOTUNE benchmarking takes 0.0376 seconds and 0.0674 seconds precompiling for 6 choices 2025-12-04T09:54:17.3828105Z Autotune Choices Stats: 2025-12-04T09:54:17.3828473Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_330", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.3828515Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3828552Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3828597Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3828832Z triton_mm_330 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3829090Z triton_mm_333 0.0061 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3829326Z triton_mm_334 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3829558Z triton_mm_331 0.0067 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3829792Z triton_mm_332 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3830028Z triton_mm_335 0.0068 ms 86.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3830154Z SingleProcess AUTOTUNE benchmarking takes 0.0424 seconds and 0.0784 seconds precompiling for 6 choices 2025-12-04T09:54:17.3830244Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3830286Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3830344Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3830445Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3830793Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3830832Z graph_break [] 2025-12-04T09:54:17.3830879Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3830953Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3830996Z Autotune Choices Stats: 2025-12-04T09:54:17.3831365Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.00583899999037385, "best_triton_pos": 0} 2025-12-04T09:54:17.3831407Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3831444Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3831490Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3831730Z triton_mm_343 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3831971Z triton_mm_342 0.0060 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3832209Z triton_mm_346 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3832441Z triton_mm_345 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3832695Z triton_mm_344 0.0067 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3832928Z triton_mm_347 0.0067 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3833058Z SingleProcess AUTOTUNE benchmarking takes 0.0381 seconds and 0.0818 seconds precompiling for 6 choices 2025-12-04T09:54:17.3833102Z Autotune Choices Stats: 2025-12-04T09:54:17.3833466Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_353", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.3833507Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3833546Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3833598Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3833831Z triton_mm_353 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3834078Z triton_mm_351 0.0067 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3834314Z triton_mm_352 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3834548Z triton_mm_350 0.0069 ms 86.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3834785Z triton_mm_348 0.0170 ms 35.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3835017Z triton_mm_349 0.0170 ms 35.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3835147Z SingleProcess AUTOTUNE benchmarking takes 0.0544 seconds and 0.0790 seconds precompiling for 6 choices 2025-12-04T09:54:17.3835187Z Autotune Choices Stats: 2025-12-04T09:54:17.3835557Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_357", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006039999891072512, "best_triton_pos": 0} 2025-12-04T09:54:17.3835608Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3835649Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3835695Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3835966Z triton_mm_357 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3836197Z triton_mm_355 0.0065 ms 92.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3836463Z triton_mm_358 0.0066 ms 91.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3836697Z triton_mm_359 0.0068 ms 89.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3836928Z triton_mm_356 0.0068 ms 88.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3837159Z triton_mm_354 0.0173 ms 35.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3837284Z SingleProcess AUTOTUNE benchmarking takes 0.0551 seconds and 0.0823 seconds precompiling for 6 choices 2025-12-04T09:54:17.3837360Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3837402Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3837463Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3837564Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3837929Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3837967Z graph_break [] 2025-12-04T09:54:17.3838012Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3838086Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3838128Z Autotune Choices Stats: 2025-12-04T09:54:17.3838500Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_370", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.3838541Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3838580Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3838626Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3838865Z triton_mm_370 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3839097Z triton_mm_371 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3839359Z triton_mm_366 0.0063 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3839589Z triton_mm_367 0.0067 ms 88.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3839824Z triton_mm_368 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3840059Z triton_mm_369 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3840205Z SingleProcess AUTOTUNE benchmarking takes 0.0368 seconds and 0.0730 seconds precompiling for 6 choices 2025-12-04T09:54:17.3840248Z Autotune Choices Stats: 2025-12-04T09:54:17.3840617Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_375", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005919000133872032, "best_triton_pos": 0} 2025-12-04T09:54:17.3840660Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3840699Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3840749Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3840982Z triton_mm_375 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3841220Z triton_mm_377 0.0066 ms 90.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3841452Z triton_mm_376 0.0067 ms 88.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3841696Z triton_mm_372 0.0146 ms 40.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3841930Z triton_mm_374 0.0160 ms 37.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3842162Z triton_mm_373 0.0166 ms 35.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3842292Z SingleProcess AUTOTUNE benchmarking takes 0.0563 seconds and 0.0969 seconds precompiling for 6 choices 2025-12-04T09:54:17.3842332Z Autotune Choices Stats: 2025-12-04T09:54:17.3842701Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T09:54:17.3842740Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3842780Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3842835Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3843075Z triton_mm_383 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3843306Z triton_mm_380 0.0067 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3843541Z triton_mm_381 0.0068 ms 88.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3843775Z triton_mm_382 0.0068 ms 88.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3844027Z triton_mm_379 0.0152 ms 39.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3844260Z triton_mm_378 0.0168 ms 35.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3844386Z SingleProcess AUTOTUNE benchmarking takes 0.0550 seconds and 0.0840 seconds precompiling for 6 choices 2025-12-04T09:54:17.3844462Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3844504Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3844563Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3844664Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3845020Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3845070Z graph_break [] 2025-12-04T09:54:17.3845113Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3845189Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3845230Z Autotune Choices Stats: 2025-12-04T09:54:17.3845595Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_390", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799000151455402, "best_triton_pos": 0} 2025-12-04T09:54:17.3845635Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3845678Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3845724Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3845993Z triton_mm_390 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3846224Z triton_mm_391 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3846460Z triton_mm_395 0.0060 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3846716Z triton_mm_393 0.0061 ms 94.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3846949Z triton_mm_392 0.0064 ms 91.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3847184Z triton_mm_394 0.0076 ms 75.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3847310Z SingleProcess AUTOTUNE benchmarking takes 0.0425 seconds and 0.0779 seconds precompiling for 6 choices 2025-12-04T09:54:17.3847354Z Autotune Choices Stats: 2025-12-04T09:54:17.3847765Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_401", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006200000178068876, "best_triton_pos": 0} 2025-12-04T09:54:17.3847809Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3847848Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3847899Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3848136Z triton_mm_401 0.0062 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3848372Z triton_mm_399 0.0064 ms 96.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3848609Z triton_mm_400 0.0067 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3848845Z triton_mm_398 0.0079 ms 78.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3849088Z triton_mm_397 0.0162 ms 38.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3849319Z triton_mm_396 0.0164 ms 37.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3849449Z SingleProcess AUTOTUNE benchmarking takes 0.0568 seconds and 0.0828 seconds precompiling for 6 choices 2025-12-04T09:54:17.3849491Z Autotune Choices Stats: 2025-12-04T09:54:17.3849855Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_407", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005679999943822622, "best_triton_pos": 0} 2025-12-04T09:54:17.3849897Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3849937Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3849982Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3850219Z triton_mm_407 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3850464Z triton_mm_402 0.0060 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3850697Z triton_mm_405 0.0060 ms 94.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3850936Z triton_mm_406 0.0061 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3851165Z triton_mm_404 0.0065 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3851417Z triton_mm_403 0.0087 ms 65.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3851541Z SingleProcess AUTOTUNE benchmarking takes 0.0412 seconds and 0.0844 seconds precompiling for 6 choices 2025-12-04T09:54:17.3851617Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3851660Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3851717Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3851818Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3852170Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3852211Z graph_break [] 2025-12-04T09:54:17.3852255Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3852331Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3852371Z Autotune Choices Stats: 2025-12-04T09:54:17.3852757Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_417", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.3852797Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3852843Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3852889Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3855179Z triton_mm_417 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3855433Z triton_mm_416 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3855668Z triton_mm_418 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3855903Z triton_mm_419 0.0082 ms 70.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3856170Z triton_mm_415 0.0138 ms 41.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3856426Z triton_mm_414 0.0147 ms 39.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3856557Z SingleProcess AUTOTUNE benchmarking takes 0.0568 seconds and 0.0778 seconds precompiling for 6 choices 2025-12-04T09:54:17.3856599Z Autotune Choices Stats: 2025-12-04T09:54:17.3856964Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_425", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.3857004Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3857043Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3857091Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3857357Z triton_mm_425 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3857594Z triton_mm_424 0.0059 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3857828Z triton_mm_421 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3858059Z triton_mm_422 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3858291Z triton_mm_420 0.0063 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3858542Z triton_mm_423 0.0070 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3858669Z SingleProcess AUTOTUNE benchmarking takes 0.0621 seconds and 0.0806 seconds precompiling for 6 choices 2025-12-04T09:54:17.3858710Z Autotune Choices Stats: 2025-12-04T09:54:17.3859079Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_429", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.3859120Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3859158Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3859205Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3859442Z triton_mm_429 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3859677Z triton_mm_428 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3859911Z triton_mm_430 0.0060 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3862898Z triton_mm_431 0.0076 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3863140Z triton_mm_426 0.0114 ms 51.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3863374Z triton_mm_427 0.0144 ms 40.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3863502Z SingleProcess AUTOTUNE benchmarking takes 0.0629 seconds and 0.0875 seconds precompiling for 6 choices 2025-12-04T09:54:17.3863599Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3863647Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3863704Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3863829Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3864182Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3864220Z graph_break [] 2025-12-04T09:54:17.3864264Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3864340Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3864380Z Autotune Choices Stats: 2025-12-04T09:54:17.3864750Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_442", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.3864804Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3864845Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3864890Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3865129Z triton_mm_442 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3865362Z triton_mm_441 0.0060 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3865601Z triton_mm_443 0.0061 ms 94.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3865836Z triton_mm_440 0.0071 ms 81.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3866105Z triton_mm_438 0.0144 ms 40.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3866335Z triton_mm_439 0.0155 ms 37.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3866482Z SingleProcess AUTOTUNE benchmarking takes 0.0566 seconds and 0.0839 seconds precompiling for 6 choices 2025-12-04T09:54:17.3866525Z Autotune Choices Stats: 2025-12-04T09:54:17.3866973Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_448", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.3867013Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3867051Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3867098Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3867334Z triton_mm_448 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3867583Z triton_mm_444 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3867819Z triton_mm_449 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3868052Z triton_mm_446 0.0067 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3868285Z triton_mm_447 0.0067 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3868519Z triton_mm_445 0.0069 ms 86.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3868661Z SingleProcess AUTOTUNE benchmarking takes 0.0427 seconds and 0.0880 seconds precompiling for 6 choices 2025-12-04T09:54:17.3868703Z Autotune Choices Stats: 2025-12-04T09:54:17.3869067Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_454", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006599999964237213, "best_triton_pos": 0} 2025-12-04T09:54:17.3869107Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3869144Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3869192Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3869427Z triton_mm_454 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3869664Z triton_mm_453 0.0067 ms 98.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3869898Z triton_mm_455 0.0068 ms 96.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3870132Z triton_mm_452 0.0122 ms 54.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3870373Z triton_mm_451 0.0124 ms 53.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3870629Z triton_mm_450 0.0141 ms 46.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3870757Z SingleProcess AUTOTUNE benchmarking takes 0.0605 seconds and 0.1108 seconds precompiling for 6 choices 2025-12-04T09:54:17.3870831Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3870874Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3870930Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3871033Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3871403Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3871445Z graph_break [] 2025-12-04T09:54:17.3871489Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3871566Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3871607Z Autotune Choices Stats: 2025-12-04T09:54:17.3871975Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_466", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005998999811708927, "best_triton_pos": 0} 2025-12-04T09:54:17.3872014Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3872053Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3872097Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3872334Z triton_mm_466 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3872579Z triton_mm_467 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3872810Z triton_mm_465 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3873042Z triton_mm_464 0.0068 ms 87.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3873275Z triton_mm_462 0.0117 ms 51.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3873507Z triton_mm_463 0.0161 ms 37.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3873633Z SingleProcess AUTOTUNE benchmarking takes 0.0556 seconds and 0.0778 seconds precompiling for 6 choices 2025-12-04T09:54:17.3873673Z Autotune Choices Stats: 2025-12-04T09:54:17.3874043Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_472", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T09:54:17.3874093Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3874133Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3874192Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3874430Z triton_mm_472 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3874664Z triton_mm_473 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3874912Z triton_mm_469 0.0061 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3875144Z triton_mm_468 0.0061 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3875375Z triton_mm_470 0.0062 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3875607Z triton_mm_471 0.0082 ms 72.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3875733Z SingleProcess AUTOTUNE benchmarking takes 0.0484 seconds and 0.0791 seconds precompiling for 6 choices 2025-12-04T09:54:17.3875775Z Autotune Choices Stats: 2025-12-04T09:54:17.3876174Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_478", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.3876236Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3876273Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3876321Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3876554Z triton_mm_478 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3876789Z triton_mm_479 0.0058 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3877021Z triton_mm_477 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3877258Z triton_mm_475 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3877492Z triton_mm_474 0.0060 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3877734Z triton_mm_476 0.0062 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3877860Z SingleProcess AUTOTUNE benchmarking takes 0.0431 seconds and 0.0849 seconds precompiling for 6 choices 2025-12-04T09:54:17.3877954Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3877997Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3878055Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3878156Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3878502Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3878541Z graph_break [] 2025-12-04T09:54:17.3878597Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3878671Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3878711Z Autotune Choices Stats: 2025-12-04T09:54:17.3879080Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_488", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T09:54:17.3879120Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3879158Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3879204Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3879438Z triton_mm_488 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3879673Z triton_mm_490 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3879922Z triton_mm_489 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3880155Z triton_mm_491 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3880386Z triton_mm_487 0.0130 ms 44.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3880621Z triton_mm_486 0.0147 ms 39.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3880748Z SingleProcess AUTOTUNE benchmarking takes 0.0483 seconds and 0.0832 seconds precompiling for 6 choices 2025-12-04T09:54:17.3880788Z Autotune Choices Stats: 2025-12-04T09:54:17.3881152Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_497", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.3881191Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3881238Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3881284Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3881521Z triton_mm_497 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3881764Z triton_mm_496 0.0064 ms 92.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3882000Z triton_mm_495 0.0065 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3882244Z triton_mm_494 0.0118 ms 50.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3882477Z triton_mm_493 0.0132 ms 45.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3882711Z triton_mm_492 0.0136 ms 43.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3882836Z SingleProcess AUTOTUNE benchmarking takes 0.0532 seconds and 0.0824 seconds precompiling for 6 choices 2025-12-04T09:54:17.3882876Z Autotune Choices Stats: 2025-12-04T09:54:17.3883239Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_502", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.3883278Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3883315Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3883371Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3883606Z triton_mm_502 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3883838Z triton_mm_503 0.0059 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3884077Z triton_mm_501 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3884309Z triton_mm_500 0.0062 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3884541Z triton_mm_499 0.0145 ms 39.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3884770Z triton_mm_498 0.0162 ms 35.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3884895Z SingleProcess AUTOTUNE benchmarking takes 0.0525 seconds and 0.0826 seconds precompiling for 6 choices 2025-12-04T09:54:17.3884980Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3885023Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3885079Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3885197Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3885543Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3885581Z graph_break [] 2025-12-04T09:54:17.3885624Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3885698Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3885738Z Autotune Choices Stats: 2025-12-04T09:54:17.3886161Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_514", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.3886202Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3886239Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3886285Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3886521Z triton_mm_514 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3886757Z triton_mm_513 0.0060 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3886989Z triton_mm_515 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3887224Z triton_mm_512 0.0062 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3887469Z triton_mm_511 0.0125 ms 47.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3887698Z triton_mm_510 0.0145 ms 40.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3887826Z SingleProcess AUTOTUNE benchmarking takes 0.0479 seconds and 0.0837 seconds precompiling for 6 choices 2025-12-04T09:54:17.3887865Z Autotune Choices Stats: 2025-12-04T09:54:17.3888234Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_518", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.00571999978274107, "best_triton_pos": 0} 2025-12-04T09:54:17.3888273Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3888310Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3888356Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3888591Z triton_mm_518 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3888837Z triton_mm_521 0.0058 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3889086Z triton_mm_516 0.0058 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3889320Z triton_mm_519 0.0058 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3889549Z triton_mm_520 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3889791Z triton_mm_517 0.0063 ms 90.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3889917Z SingleProcess AUTOTUNE benchmarking takes 0.0385 seconds and 0.0777 seconds precompiling for 6 choices 2025-12-04T09:54:17.3889958Z Autotune Choices Stats: 2025-12-04T09:54:17.3890326Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_526", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.3890366Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3890403Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3890449Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3890684Z triton_mm_526 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3890916Z triton_mm_524 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3891158Z triton_mm_525 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3891388Z triton_mm_523 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3891621Z triton_mm_527 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3891853Z triton_mm_522 0.0149 ms 39.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3891979Z SingleProcess AUTOTUNE benchmarking takes 0.0475 seconds and 0.0823 seconds precompiling for 6 choices 2025-12-04T09:54:17.3892052Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3892095Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3892150Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3892252Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3892621Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3892671Z graph_break [] 2025-12-04T09:54:17.3892715Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3892787Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3892828Z Autotune Choices Stats: 2025-12-04T09:54:17.3893195Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_538", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005679999943822622, "best_triton_pos": 0} 2025-12-04T09:54:17.3893235Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3893281Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3893327Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3893565Z triton_mm_538 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3893800Z triton_mm_535 0.0058 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3894031Z triton_mm_536 0.0058 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3894266Z triton_mm_539 0.0058 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3894501Z triton_mm_537 0.0060 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3894744Z triton_mm_534 0.0105 ms 54.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3894870Z SingleProcess AUTOTUNE benchmarking takes 0.0474 seconds and 0.0745 seconds precompiling for 6 choices 2025-12-04T09:54:17.3894909Z Autotune Choices Stats: 2025-12-04T09:54:17.3895277Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_542", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.3895316Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3895355Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3895402Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3895639Z triton_mm_542 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3895870Z triton_mm_545 0.0063 ms 91.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3896135Z triton_mm_543 0.0064 ms 91.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3896404Z triton_mm_544 0.0066 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3896636Z triton_mm_541 0.0117 ms 49.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3896868Z triton_mm_540 0.0164 ms 35.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3896994Z SingleProcess AUTOTUNE benchmarking takes 0.0479 seconds and 0.0806 seconds precompiling for 6 choices 2025-12-04T09:54:17.3897048Z Autotune Choices Stats: 2025-12-04T09:54:17.3897414Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_548", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.00571999978274107, "best_triton_pos": 0} 2025-12-04T09:54:17.3897454Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3897490Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3897535Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3897770Z triton_mm_548 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3898001Z triton_mm_549 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3898235Z triton_mm_551 0.0058 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3898481Z triton_mm_546 0.0059 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3898714Z triton_mm_550 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3898946Z triton_mm_547 0.0068 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3899073Z SingleProcess AUTOTUNE benchmarking takes 0.0381 seconds and 0.0785 seconds precompiling for 6 choices 2025-12-04T09:54:17.3899150Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3899192Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3899249Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3899350Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3899696Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3899750Z graph_break [] 2025-12-04T09:54:17.3899795Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3899868Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3899909Z Autotune Choices Stats: 2025-12-04T09:54:17.3900287Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_560", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T09:54:17.3900327Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3900363Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3900409Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3900646Z triton_mm_560 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3900887Z triton_mm_558 0.0062 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3901119Z triton_mm_562 0.0065 ms 92.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3901350Z triton_mm_559 0.0068 ms 88.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3901581Z triton_mm_563 0.0068 ms 87.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3901811Z triton_mm_561 0.0071 ms 84.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3901952Z SingleProcess AUTOTUNE benchmarking takes 0.0437 seconds and 0.0912 seconds precompiling for 6 choices 2025-12-04T09:54:17.3901991Z Autotune Choices Stats: 2025-12-04T09:54:17.3902357Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_567", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.3902395Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3902434Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3902480Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3902717Z triton_mm_567 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3902950Z triton_mm_569 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3903181Z triton_mm_568 0.0064 ms 92.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3903412Z triton_mm_566 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3903654Z triton_mm_565 0.0164 ms 36.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3903898Z triton_mm_564 0.0167 ms 35.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3904023Z SingleProcess AUTOTUNE benchmarking takes 0.0543 seconds and 0.0816 seconds precompiling for 6 choices 2025-12-04T09:54:17.3904064Z Autotune Choices Stats: 2025-12-04T09:54:17.3904444Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_573", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.3904484Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3904523Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3904567Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3904804Z triton_mm_573 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3905035Z triton_mm_571 0.0059 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3905267Z triton_mm_570 0.0059 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3905498Z triton_mm_572 0.0060 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3905743Z triton_mm_574 0.0060 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3906000Z triton_mm_575 0.0067 ms 88.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3906125Z SingleProcess AUTOTUNE benchmarking takes 0.1428 seconds and 0.0823 seconds precompiling for 6 choices 2025-12-04T09:54:17.3906201Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3906242Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3906300Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3906401Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3906760Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3906796Z graph_break [] 2025-12-04T09:54:17.3906840Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3906913Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3906953Z Autotune Choices Stats: 2025-12-04T09:54:17.3907330Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_584", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T09:54:17.3907387Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3907425Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3907471Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3907705Z triton_mm_584 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3907937Z triton_mm_585 0.0067 ms 89.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3908184Z triton_mm_586 0.0067 ms 89.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3908414Z triton_mm_587 0.0068 ms 88.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3908650Z triton_mm_582 0.0150 ms 40.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3908883Z triton_mm_583 0.0162 ms 36.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3909011Z SingleProcess AUTOTUNE benchmarking takes 0.0565 seconds and 0.0808 seconds precompiling for 6 choices 2025-12-04T09:54:17.3909051Z Autotune Choices Stats: 2025-12-04T09:54:17.3909417Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_591", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006120000034570694, "best_triton_pos": 0} 2025-12-04T09:54:17.3909470Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3909509Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3909555Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3909791Z triton_mm_591 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3910025Z triton_mm_590 0.0064 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3910256Z triton_mm_593 0.0067 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3910490Z triton_mm_592 0.0068 ms 90.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3910724Z triton_mm_588 0.0152 ms 40.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3910972Z triton_mm_589 0.0170 ms 35.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3911113Z SingleProcess AUTOTUNE benchmarking takes 0.0561 seconds and 0.0832 seconds precompiling for 6 choices 2025-12-04T09:54:17.3911155Z Autotune Choices Stats: 2025-12-04T09:54:17.3911518Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_596", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.3911560Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3911596Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3911642Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3911891Z triton_mm_596 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3912126Z triton_mm_599 0.0062 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3912362Z triton_mm_594 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3912597Z triton_mm_597 0.0065 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3912834Z triton_mm_598 0.0067 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3913085Z triton_mm_595 0.0070 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3913211Z SingleProcess AUTOTUNE benchmarking takes 0.0422 seconds and 0.0839 seconds precompiling for 6 choices 2025-12-04T09:54:17.3913286Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3913329Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3913385Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3913488Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3913838Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3913877Z graph_break [] 2025-12-04T09:54:17.3913921Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3913995Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3914036Z Autotune Choices Stats: 2025-12-04T09:54:17.3914401Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_609", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.3914452Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3914489Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3914538Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3914789Z triton_mm_609 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3915028Z triton_mm_611 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3915260Z triton_mm_610 0.0061 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3915505Z triton_mm_608 0.0074 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3915740Z triton_mm_607 0.0108 ms 53.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3916009Z triton_mm_606 0.0126 ms 45.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3916137Z SingleProcess AUTOTUNE benchmarking takes 0.0570 seconds and 0.0813 seconds precompiling for 6 choices 2025-12-04T09:54:17.3916176Z Autotune Choices Stats: 2025-12-04T09:54:17.3916544Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_616", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.3916583Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3916638Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3916686Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3916924Z triton_mm_616 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3917156Z triton_mm_614 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3917389Z triton_mm_615 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3917620Z triton_mm_617 0.0075 ms 76.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3917851Z triton_mm_612 0.0146 ms 39.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3918082Z triton_mm_613 0.0148 ms 38.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3918222Z SingleProcess AUTOTUNE benchmarking takes 0.0564 seconds and 0.0794 seconds precompiling for 6 choices 2025-12-04T09:54:17.3918264Z Autotune Choices Stats: 2025-12-04T09:54:17.3918645Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_621", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T09:54:17.3918685Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3918723Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3918767Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3919006Z triton_mm_621 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3919252Z triton_mm_620 0.0060 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3919488Z triton_mm_622 0.0076 ms 77.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3919721Z triton_mm_623 0.0080 ms 72.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3919954Z triton_mm_618 0.0156 ms 37.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3920185Z triton_mm_619 0.0163 ms 35.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3920312Z SingleProcess AUTOTUNE benchmarking takes 0.0544 seconds and 0.0817 seconds precompiling for 6 choices 2025-12-04T09:54:17.3920398Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3920440Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3920499Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3920600Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3920954Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3920991Z graph_break [] 2025-12-04T09:54:17.3921036Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3921108Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3921150Z Autotune Choices Stats: 2025-12-04T09:54:17.3921516Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_635", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T09:54:17.3921557Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3921593Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3921639Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3921874Z triton_mm_635 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3922119Z triton_mm_633 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3922365Z triton_mm_634 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3922599Z triton_mm_632 0.0072 ms 82.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3922835Z triton_mm_630 0.0108 ms 55.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3923081Z triton_mm_631 0.0139 ms 43.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3923212Z SingleProcess AUTOTUNE benchmarking takes 0.0528 seconds and 0.0731 seconds precompiling for 6 choices 2025-12-04T09:54:17.3923251Z Autotune Choices Stats: 2025-12-04T09:54:17.3923615Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_641", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.3923654Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3923693Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3923741Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3923979Z triton_mm_641 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3924224Z triton_mm_638 0.0060 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3924456Z triton_mm_636 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3924690Z triton_mm_639 0.0060 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3924926Z triton_mm_640 0.0060 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3925163Z triton_mm_637 0.0061 ms 94.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3925288Z SingleProcess AUTOTUNE benchmarking takes 0.0424 seconds and 0.0812 seconds precompiling for 6 choices 2025-12-04T09:54:17.3925330Z Autotune Choices Stats: 2025-12-04T09:54:17.3925697Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_642", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.3925754Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3925793Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3925837Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3926136Z triton_mm_642 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3926370Z triton_mm_646 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3926627Z triton_mm_645 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3926862Z triton_mm_644 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3927095Z triton_mm_647 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3927327Z triton_mm_643 0.0066 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3927451Z SingleProcess AUTOTUNE benchmarking takes 0.0410 seconds and 0.0762 seconds precompiling for 6 choices 2025-12-04T09:54:17.3927526Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3927569Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3927627Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3927726Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3928091Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3928127Z graph_break [] 2025-12-04T09:54:17.3928171Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3928245Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3928286Z Autotune Choices Stats: 2025-12-04T09:54:17.3928652Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_656", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006639000028371811, "best_triton_pos": 0} 2025-12-04T09:54:17.3928691Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3928728Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3928776Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3929013Z triton_mm_656 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3929248Z triton_mm_657 0.0067 ms 98.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3929502Z triton_mm_658 0.0067 ms 98.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3929743Z triton_mm_659 0.0067 ms 98.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3929976Z triton_mm_655 0.0121 ms 54.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3930206Z triton_mm_654 0.0150 ms 44.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3930346Z SingleProcess AUTOTUNE benchmarking takes 0.0571 seconds and 0.0857 seconds precompiling for 6 choices 2025-12-04T09:54:17.3930385Z Autotune Choices Stats: 2025-12-04T09:54:17.3930754Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_664", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.00595899997279048, "best_triton_pos": 0} 2025-12-04T09:54:17.3930795Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3930833Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3930880Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3931115Z triton_mm_664 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3931350Z triton_mm_665 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3931581Z triton_mm_661 0.0067 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3931824Z triton_mm_663 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3932056Z triton_mm_662 0.0069 ms 86.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3932291Z triton_mm_660 0.0071 ms 83.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3932418Z SingleProcess AUTOTUNE benchmarking takes 0.0416 seconds and 0.0790 seconds precompiling for 6 choices 2025-12-04T09:54:17.3932458Z Autotune Choices Stats: 2025-12-04T09:54:17.3932829Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_670", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.00583899999037385, "best_triton_pos": 0} 2025-12-04T09:54:17.3932866Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3932905Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3932950Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3933197Z triton_mm_670 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3933438Z triton_mm_669 0.0062 ms 94.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3933671Z triton_mm_666 0.0062 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3933899Z triton_mm_668 0.0068 ms 86.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3934141Z triton_mm_671 0.0069 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3934379Z triton_mm_667 0.0071 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3934507Z SingleProcess AUTOTUNE benchmarking takes 0.0445 seconds and 0.0767 seconds precompiling for 6 choices 2025-12-04T09:54:17.3934583Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3934626Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3934686Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3934785Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3935140Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3935187Z graph_break [] 2025-12-04T09:54:17.3935231Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3935304Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3935345Z Autotune Choices Stats: 2025-12-04T09:54:17.3935709Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_681", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.3935750Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3935787Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3935833Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3936097Z triton_mm_681 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3936332Z triton_mm_680 0.0060 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3936567Z triton_mm_683 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3936799Z triton_mm_679 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3937062Z triton_mm_678 0.0062 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3937291Z triton_mm_682 0.0062 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3937418Z SingleProcess AUTOTUNE benchmarking takes 0.0406 seconds and 0.0737 seconds precompiling for 6 choices 2025-12-04T09:54:17.3937457Z Autotune Choices Stats: 2025-12-04T09:54:17.3937834Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_685", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.00687999976798892, "best_triton_pos": 0} 2025-12-04T09:54:17.3937877Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3937916Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3937964Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3938197Z triton_mm_685 0.0069 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3938428Z triton_mm_684 0.0074 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3938661Z triton_mm_686 0.0087 ms 78.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3938895Z triton_mm_689 0.0088 ms 78.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3939142Z triton_mm_688 0.0091 ms 75.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3939372Z triton_mm_687 0.0097 ms 70.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3939496Z SingleProcess AUTOTUNE benchmarking takes 0.0512 seconds and 0.0758 seconds precompiling for 6 choices 2025-12-04T09:54:17.3939538Z Autotune Choices Stats: 2025-12-04T09:54:17.3939903Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_691", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.007079999893903732, "best_triton_pos": 0} 2025-12-04T09:54:17.3939943Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3939980Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3940025Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3940260Z triton_mm_691 0.0071 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3940496Z triton_mm_694 0.0071 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3940759Z triton_mm_695 0.0071 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3940990Z triton_mm_692 0.0072 ms 98.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3941221Z triton_mm_693 0.0072 ms 97.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3941465Z triton_mm_690 0.0074 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3941590Z SingleProcess AUTOTUNE benchmarking takes 0.0479 seconds and 0.0859 seconds precompiling for 6 choices 2025-12-04T09:54:17.3941665Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3941707Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3941764Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3941864Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3942214Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3942252Z graph_break [] 2025-12-04T09:54:17.3942299Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3942372Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3942412Z Autotune Choices Stats: 2025-12-04T09:54:17.3942787Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_702", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.3942839Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3942876Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3942921Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3943158Z triton_mm_702 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3943391Z triton_mm_704 0.0059 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3943629Z triton_mm_705 0.0067 ms 88.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3943861Z triton_mm_703 0.0068 ms 86.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3944095Z triton_mm_707 0.0068 ms 86.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3944339Z triton_mm_706 0.0069 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3944477Z SingleProcess AUTOTUNE benchmarking takes 0.0440 seconds and 0.0776 seconds precompiling for 6 choices 2025-12-04T09:54:17.3944518Z Autotune Choices Stats: 2025-12-04T09:54:17.3944884Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_713", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006118999794125557, "best_triton_pos": 0} 2025-12-04T09:54:17.3944924Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3944962Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3945011Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3945255Z triton_mm_713 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3945491Z triton_mm_712 0.0068 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3945723Z triton_mm_711 0.0107 ms 57.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3945993Z triton_mm_710 0.0146 ms 42.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3946230Z triton_mm_709 0.0165 ms 37.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3946477Z triton_mm_708 0.0167 ms 36.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3946604Z SingleProcess AUTOTUNE benchmarking takes 0.0660 seconds and 0.0859 seconds precompiling for 6 choices 2025-12-04T09:54:17.3946644Z Autotune Choices Stats: 2025-12-04T09:54:17.3947013Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_714", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006200000178068876, "best_triton_pos": 0} 2025-12-04T09:54:17.3947051Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3947090Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3947134Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3947371Z triton_mm_714 0.0062 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3947600Z triton_mm_715 0.0063 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3947835Z triton_mm_719 0.0066 ms 94.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3948079Z triton_mm_716 0.0068 ms 91.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3948330Z triton_mm_718 0.0068 ms 91.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3948560Z triton_mm_717 0.0069 ms 89.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3948686Z SingleProcess AUTOTUNE benchmarking takes 0.0450 seconds and 0.0836 seconds precompiling for 6 choices 2025-12-04T09:54:17.3948763Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3948816Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3948875Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3948977Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3949330Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3949366Z graph_break [] 2025-12-04T09:54:17.3949409Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3949483Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3949525Z Autotune Choices Stats: 2025-12-04T09:54:17.3949892Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_731", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T09:54:17.3949941Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3949981Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3950025Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3950263Z triton_mm_731 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3950494Z triton_mm_730 0.0061 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3950733Z triton_mm_728 0.0068 ms 87.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3950966Z triton_mm_729 0.0070 ms 85.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3951199Z triton_mm_726 0.0162 ms 37.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3951435Z triton_mm_727 0.0164 ms 36.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3951573Z SingleProcess AUTOTUNE benchmarking takes 0.0575 seconds and 0.0835 seconds precompiling for 6 choices 2025-12-04T09:54:17.3951617Z Autotune Choices Stats: 2025-12-04T09:54:17.3951993Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_736", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.3952033Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3952070Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3952117Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3952351Z triton_mm_736 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3952597Z triton_mm_737 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3952831Z triton_mm_732 0.0064 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3953068Z triton_mm_735 0.0067 ms 89.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3953300Z triton_mm_733 0.0069 ms 86.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3953535Z triton_mm_734 0.0070 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3953662Z SingleProcess AUTOTUNE benchmarking takes 0.0434 seconds and 0.0660 seconds precompiling for 6 choices 2025-12-04T09:54:17.3953714Z Autotune Choices Stats: 2025-12-04T09:54:17.3954078Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_740", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.3954116Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3954154Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3954200Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3954442Z triton_mm_740 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3954675Z triton_mm_742 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3954913Z triton_mm_743 0.0065 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3955151Z triton_mm_741 0.0066 ms 89.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3955398Z triton_mm_739 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3955640Z triton_mm_738 0.0078 ms 75.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3955767Z SingleProcess AUTOTUNE benchmarking takes 0.0425 seconds and 0.0590 seconds precompiling for 6 choices 2025-12-04T09:54:17.3955842Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3955885Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3955976Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3956077Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3956438Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3956479Z graph_break [] 2025-12-04T09:54:17.3956523Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3956596Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3956636Z Autotune Choices Stats: 2025-12-04T09:54:17.3957002Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_755", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006639999803155661, "best_triton_pos": 0} 2025-12-04T09:54:17.3957040Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3957077Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3957123Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3957367Z triton_mm_755 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3957611Z triton_mm_753 0.0067 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3957844Z triton_mm_752 0.0067 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3958078Z triton_mm_754 0.0069 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3958312Z triton_mm_750 0.0126 ms 52.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3958545Z triton_mm_751 0.0172 ms 38.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3958670Z SingleProcess AUTOTUNE benchmarking takes 0.0560 seconds and 0.0783 seconds precompiling for 6 choices 2025-12-04T09:54:17.3958712Z Autotune Choices Stats: 2025-12-04T09:54:17.3959081Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_756", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006320000160485506, "best_triton_pos": 0} 2025-12-04T09:54:17.3959135Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3959174Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3959236Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3959471Z triton_mm_756 0.0063 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3959703Z triton_mm_761 0.0064 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3959944Z triton_mm_757 0.0066 ms 95.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3960180Z triton_mm_760 0.0067 ms 94.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3960412Z triton_mm_759 0.0067 ms 94.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3960643Z triton_mm_758 0.0068 ms 92.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3960770Z SingleProcess AUTOTUNE benchmarking takes 0.0428 seconds and 0.0742 seconds precompiling for 6 choices 2025-12-04T09:54:17.3960812Z Autotune Choices Stats: 2025-12-04T09:54:17.3961185Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_766", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006719999946653843, "best_triton_pos": 0} 2025-12-04T09:54:17.3961237Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3961276Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3961320Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3961556Z triton_mm_766 0.0067 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3961792Z triton_mm_767 0.0067 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3962029Z triton_mm_765 0.0068 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3962262Z triton_mm_764 0.0070 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3962494Z triton_mm_762 0.0167 ms 40.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3962727Z triton_mm_763 0.0171 ms 39.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3962861Z SingleProcess AUTOTUNE benchmarking takes 0.0577 seconds and 0.0862 seconds precompiling for 6 choices 2025-12-04T09:54:17.3962947Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3962991Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3963049Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3963149Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3963496Z inductor [('triton_bundler_save_kernel', 48), ('generated_module_cache_hit', 5), ('benchmarking.InductorBenchmarker.benchmark', 5), ('benchmarking.InductorBenchmarker.benchmark_gpu', 5), ('async_compile_cache_miss', 3), ('fxgraph_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3963534Z graph_break [] 2025-12-04T09:54:17.3963587Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3963662Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3963702Z Autotune Choices Stats: 2025-12-04T09:54:17.3964072Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_777", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.0061599998734891415, "best_triton_pos": 0} 2025-12-04T09:54:17.3964112Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3964151Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3964197Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3964433Z triton_mm_777 0.0062 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3964667Z triton_mm_779 0.0067 ms 92.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3964917Z triton_mm_778 0.0068 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3965156Z triton_mm_776 0.0072 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3965392Z triton_mm_774 0.0119 ms 51.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3965627Z triton_mm_775 0.0119 ms 51.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3965756Z SingleProcess AUTOTUNE benchmarking takes 0.0665 seconds and 0.0842 seconds precompiling for 6 choices 2025-12-04T09:54:17.3965800Z Autotune Choices Stats: 2025-12-04T09:54:17.3966197Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_783", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006639000028371811, "best_triton_pos": 0} 2025-12-04T09:54:17.3966238Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3966292Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3966342Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3966577Z triton_mm_783 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3966827Z triton_mm_782 0.0068 ms 98.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3967061Z triton_mm_784 0.0068 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3967296Z triton_mm_785 0.0068 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3967543Z triton_mm_781 0.0072 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3967776Z triton_mm_780 0.0073 ms 90.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3967904Z SingleProcess AUTOTUNE benchmarking takes 0.0427 seconds and 0.0786 seconds precompiling for 6 choices 2025-12-04T09:54:17.3967944Z Autotune Choices Stats: 2025-12-04T09:54:17.3968310Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_786", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006120000034570694, "best_triton_pos": 0} 2025-12-04T09:54:17.3968351Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3968389Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3968435Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3968690Z triton_mm_786 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3968924Z triton_mm_789 0.0066 ms 93.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3969154Z triton_mm_791 0.0067 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3969389Z triton_mm_788 0.0068 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3969624Z triton_mm_790 0.0068 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3969858Z triton_mm_787 0.0075 ms 81.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3969984Z SingleProcess AUTOTUNE benchmarking takes 0.0473 seconds and 0.0571 seconds precompiling for 6 choices 2025-12-04T09:54:17.3970067Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3970112Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3970169Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T09:54:17.3970271Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3970797Z inductor [('triton_bundler_save_kernel', 128), ('benchmarking.InductorBenchmarker.benchmark_gpu', 14), ('async_compile_cache_miss', 12), ('benchmarking.InductorBenchmarker.benchmark', 8), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3970837Z graph_break [] 2025-12-04T09:54:17.3970881Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.3970955Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3970996Z Autotune Choices Stats: 2025-12-04T09:54:17.3971381Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_802", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.00571999978274107, "best_triton_pos": 0} 2025-12-04T09:54:17.3971421Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3971460Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3971506Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3971748Z triton_mm_802 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3971983Z triton_mm_800 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3972217Z triton_mm_803 0.0058 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3972461Z triton_mm_801 0.0059 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3972692Z triton_mm_798 0.0113 ms 50.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3972927Z triton_mm_799 0.0142 ms 40.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3973054Z SingleProcess AUTOTUNE benchmarking takes 0.0546 seconds and 0.0824 seconds precompiling for 6 choices 2025-12-04T09:54:17.3973142Z __________________ TestPatternMatcher.test_mixed_mm_epi_works __________________ 2025-12-04T09:54:17.3973190Z Traceback (most recent call last): 2025-12-04T09:54:17.3973342Z File "/var/lib/jenkins/pytorch/test/inductor/test_pattern_matcher.py", line 463, in test_mixed_mm_epi_works 2025-12-04T09:54:17.3973396Z self._test_mixed_impl(fn, args, True, False) 2025-12-04T09:54:17.3973530Z File "/var/lib/jenkins/pytorch/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:54:17.3973609Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:54:17.3973687Z RuntimeError: Expected to find "tl.dot" but did not find it 2025-12-04T09:54:17.3973734Z Searched string: 2025-12-04T09:54:17.3973780Z # inductor generates a suffix 2025-12-04T09:54:17.3973823Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.3973959Z tmp0 = tl.load(in_ptr2 + (tl.broadcast_to(idx_n, [BLOCK_M, BLOCK_N])), mask, eviction_policy='evict_last').to(tl.float32) 2025-12-04T09:54:17.3974096Z tmp2 = tl.load(in_ptr3 + (tl.broadcast_to(idx_n, [BLOCK_M, BLOCK_N])), mask, eviction_policy='evict_last').to(tl.float32) 2025-12-04T09:54:17.3974139Z tmp1 = acc * tmp0 2025-12-04T09:54:17.3974178Z tmp3 = tmp1 + tmp2 2025-12-04T09:54:17.3974281Z tl.store(out_ptr1 + (tl.broadcast_to(idx_n + 8*idx_m, [BLOCK_M, BLOCK_N])), tmp3, mask) 2025-12-04T09:54:17.3974320Z ''', device_str='cuda') 2025-12-04T09:54:17.3974325Z 2025-12-04T09:54:17.3974328Z 2025-12-04T09:54:17.3974371Z async_compile.wait(globals()) 2025-12-04T09:54:17.3974412Z del async_compile 2025-12-04T09:54:17.3974414Z 2025-12-04T09:54:17.3974450Z class Runner: 2025-12-04T09:54:17.3974501Z def __init__(self, partitions): 2025-12-04T09:54:17.3974547Z self.partitions = partitions 2025-12-04T09:54:17.3974549Z 2025-12-04T09:54:17.3974607Z def recursively_apply_fns(self, fns): 2025-12-04T09:54:17.3974650Z new_callables = [] 2025-12-04T09:54:17.3974706Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:54:17.3974755Z new_callables.append(fn(c)) 2025-12-04T09:54:17.3974806Z self.partitions = new_callables 2025-12-04T09:54:17.3974808Z 2025-12-04T09:54:17.3974849Z def call(self, args): 2025-12-04T09:54:17.3974897Z arg0_1, arg1_1, arg2_1, arg3_1 = args 2025-12-04T09:54:17.3974935Z args.clear() 2025-12-04T09:54:17.3974989Z assert_size_stride(arg0_1, (2, 8), (8, 1)) 2025-12-04T09:54:17.3975037Z assert_size_stride(arg1_1, (8, 2), (2, 1)) 2025-12-04T09:54:17.3975086Z assert_size_stride(arg2_1, (8, ), (1, )) 2025-12-04T09:54:17.3975134Z assert_size_stride(arg3_1, (8, ), (1, )) 2025-12-04T09:54:17.3975183Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:54:17.3975227Z torch.cuda.set_device(0) 2025-12-04T09:54:17.3975299Z buf0 = empty_strided_cuda((2, 8), (8, 1), torch.bfloat16) 2025-12-04T09:54:17.3975390Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:54:17.3975447Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.3975526Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 16, stream=stream0) 2025-12-04T09:54:17.3975565Z del arg0_1 2025-12-04T09:54:17.3975630Z buf1 = empty_strided_cuda((8, 8), (8, 1), torch.bfloat16) 2025-12-04T09:54:17.3975711Z # Topologically Sorted Source Nodes: [mm], Original ATen: [aten.mm] 2025-12-04T09:54:17.3975754Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.3975825Z triton_poi_fused_mm_1.run(arg1_1, buf1, 64, stream=stream0) 2025-12-04T09:54:17.3975862Z del arg1_1 2025-12-04T09:54:17.3975956Z buf2 = empty_strided_cuda((8, 8), (8, 1), torch.bfloat16) 2025-12-04T09:54:17.3976061Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:54:17.3976106Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.3976181Z triton_poi_fused__to_copy_mm_2.run(buf0, buf2, 64, stream=stream0) 2025-12-04T09:54:17.3976219Z del buf0 2025-12-04T09:54:17.3976280Z buf4 = empty_strided_cuda((8, 8), (8, 1), torch.bfloat16) 2025-12-04T09:54:17.3976416Z # Topologically Sorted Source Nodes: [mm, to, mul, add], Original ATen: [aten.mm, aten._to_copy, aten.mul, aten.add] 2025-12-04T09:54:17.3976458Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.3976572Z triton_tem_fused__to_copy_add_mm_mul_3.run(buf1, buf2, arg2_1, arg3_1, buf4, 1, 1, 1, stream=stream0) 2025-12-04T09:54:17.3976610Z del arg2_1 2025-12-04T09:54:17.3976648Z del arg3_1 2025-12-04T09:54:17.3976684Z del buf1 2025-12-04T09:54:17.3976736Z del buf2 2025-12-04T09:54:17.3976775Z return (buf4, ) 2025-12-04T09:54:17.3976778Z 2025-12-04T09:54:17.3976829Z runner = Runner(partitions=[]) 2025-12-04T09:54:17.3976867Z call = runner.call 2025-12-04T09:54:17.3976940Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:54:17.3976943Z 2025-12-04T09:54:17.3976945Z 2025-12-04T09:54:17.3977021Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:54:17.3977085Z from torch._dynamo.testing import rand_strided 2025-12-04T09:54:17.3977149Z from torch._inductor.utils import print_performance 2025-12-04T09:54:17.3977229Z arg0_1 = rand_strided((2, 8), (8, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:54:17.3977305Z arg1_1 = rand_strided((8, 2), (2, 1), device='cuda:0', dtype=torch.bfloat16) 2025-12-04T09:54:17.3977382Z arg2_1 = rand_strided((8, ), (1, ), device='cuda:0', dtype=torch.bfloat16) 2025-12-04T09:54:17.3977455Z arg3_1 = rand_strided((8, ), (1, ), device='cuda:0', dtype=torch.bfloat16) 2025-12-04T09:54:17.3977528Z fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1]) 2025-12-04T09:54:17.3977600Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:54:17.3977602Z 2025-12-04T09:54:17.3977604Z 2025-12-04T09:54:17.3977646Z if __name__ == "__main__": 2025-12-04T09:54:17.3977732Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:54:17.3977803Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:54:17.3977842Z From CHECK: tl.dot 2025-12-04T09:54:17.3977844Z 2025-12-04T09:54:17.3977846Z 2025-12-04T09:54:17.3977925Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:17.3978070Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_epi_works 2025-12-04T09:54:17.3978072Z 2025-12-04T09:54:17.3978162Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:17.3978237Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3978283Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3978342Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.3978447Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3978966Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3979004Z graph_break [] 2025-12-04T09:54:17.3979051Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.3979125Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3979169Z Autotune Choices Stats: 2025-12-04T09:54:17.3979545Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_36", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.3979589Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3979627Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3979675Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3979917Z triton_mm_36 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3980151Z triton_mm_40 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3980392Z triton_mm_38 0.0067 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3980637Z triton_mm_39 0.0067 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3980868Z triton_mm_41 0.0067 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3981115Z triton_mm_37 0.0069 ms 84.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3981247Z SingleProcess AUTOTUNE benchmarking takes 0.5618 seconds and 0.0833 seconds precompiling for 6 choices 2025-12-04T09:54:17.3981288Z Autotune Choices Stats: 2025-12-04T09:54:17.3981655Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_47", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.3981694Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3981733Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3981781Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3982017Z triton_mm_47 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3982249Z triton_mm_46 0.0059 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3982490Z triton_mm_43 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3982724Z triton_mm_44 0.0067 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3982956Z triton_mm_45 0.0068 ms 86.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3983189Z triton_mm_42 0.0075 ms 78.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3983317Z SingleProcess AUTOTUNE benchmarking takes 0.0396 seconds and 0.0564 seconds precompiling for 6 choices 2025-12-04T09:54:17.3983359Z Autotune Choices Stats: 2025-12-04T09:54:17.3983726Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_53", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006719999946653843, "best_triton_pos": 0} 2025-12-04T09:54:17.3983764Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3983814Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3983860Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3984096Z triton_mm_53 0.0067 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3984340Z triton_mm_51 0.0068 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3984572Z triton_mm_52 0.0068 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3984801Z triton_mm_50 0.0119 ms 56.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3985045Z triton_mm_49 0.0134 ms 50.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3985276Z triton_mm_48 0.0163 ms 41.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3985402Z SingleProcess AUTOTUNE benchmarking takes 0.0536 seconds and 0.0777 seconds precompiling for 6 choices 2025-12-04T09:54:17.3985478Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3985520Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3985579Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.3985683Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3986208Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3986263Z graph_break [] 2025-12-04T09:54:17.3986307Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.3986382Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3986424Z Autotune Choices Stats: 2025-12-04T09:54:17.3986793Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T09:54:17.3986835Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3986872Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3986919Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3987165Z triton_mm_2016 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3987402Z triton_mm_2017 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3987637Z triton_mm_2021 0.0062 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3987900Z triton_mm_2018 0.0062 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3988137Z triton_mm_2019 0.0068 ms 86.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3988370Z triton_mm_2020 0.0076 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3988496Z SingleProcess AUTOTUNE benchmarking takes 0.0426 seconds and 0.0778 seconds precompiling for 6 choices 2025-12-04T09:54:17.3988537Z Autotune Choices Stats: 2025-12-04T09:54:17.3988919Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2022", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.00571999978274107, "best_triton_pos": 0} 2025-12-04T09:54:17.3988962Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3988999Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3989047Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3989281Z triton_mm_2022 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3989519Z triton_mm_2026 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3989755Z triton_mm_2024 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3990005Z triton_mm_2027 0.0060 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3990239Z triton_mm_2025 0.0061 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3990474Z triton_mm_2023 0.0062 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3990600Z SingleProcess AUTOTUNE benchmarking takes 0.0428 seconds and 0.0751 seconds precompiling for 6 choices 2025-12-04T09:54:17.3990642Z Autotune Choices Stats: 2025-12-04T09:54:17.3991008Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2031", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.3991045Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3991084Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3991128Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3991368Z triton_mm_2031 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3991622Z triton_mm_2032 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3991858Z triton_mm_2033 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3992096Z triton_mm_2030 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3992339Z triton_mm_2029 0.0129 ms 46.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3992573Z triton_mm_2028 0.0165 ms 35.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3992698Z SingleProcess AUTOTUNE benchmarking takes 0.0557 seconds and 0.0847 seconds precompiling for 6 choices 2025-12-04T09:54:17.3992774Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.3992816Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.3992875Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.3992976Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.3993471Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.3993522Z graph_break [] 2025-12-04T09:54:17.3993567Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.3993641Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.3993683Z Autotune Choices Stats: 2025-12-04T09:54:17.3994052Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2036", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.3994091Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.3994131Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.3994176Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.3994415Z triton_mm_2036 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3994648Z triton_mm_2034 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3994883Z triton_mm_2037 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3995126Z triton_mm_2035 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3995376Z triton_mm_2039 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3995612Z triton_mm_2038 0.0066 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3995739Z SingleProcess AUTOTUNE benchmarking takes 0.0410 seconds and 0.0600 seconds precompiling for 6 choices 2025-12-04T09:54:17.3995780Z Autotune Choices Stats: 2025-12-04T09:54:17.3996187Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2043", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.3996227Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.3996266Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.3996318Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.3996553Z triton_mm_2043 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3996786Z triton_mm_2042 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3997019Z triton_mm_2045 0.0060 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3997253Z triton_mm_2044 0.0061 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3997500Z triton_mm_2041 0.0076 ms 76.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3997731Z triton_mm_2040 0.0156 ms 37.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3997860Z SingleProcess AUTOTUNE benchmarking takes 0.0513 seconds and 0.0748 seconds precompiling for 6 choices 2025-12-04T09:54:17.3997901Z Autotune Choices Stats: 2025-12-04T09:54:17.3998271Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2049", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.3998309Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.3998348Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.3998393Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.3998631Z triton_mm_2049 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3998879Z triton_mm_2048 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3999130Z triton_mm_2051 0.0061 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3999366Z triton_mm_2047 0.0062 ms 94.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3999599Z triton_mm_2050 0.0064 ms 91.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.3999846Z triton_mm_2046 0.0069 ms 86.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.3999973Z SingleProcess AUTOTUNE benchmarking takes 0.0424 seconds and 0.0784 seconds precompiling for 6 choices 2025-12-04T09:54:17.4000052Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4000095Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4000152Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4000251Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4000750Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4000790Z graph_break [] 2025-12-04T09:54:17.4000834Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4000928Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4000970Z Autotune Choices Stats: 2025-12-04T09:54:17.4001336Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2054", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.00571999978274107, "best_triton_pos": 0} 2025-12-04T09:54:17.4001374Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4001412Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4001456Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4001698Z triton_mm_2054 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4001932Z triton_mm_2052 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4002172Z triton_mm_2057 0.0059 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4002407Z triton_mm_2055 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4002649Z triton_mm_2053 0.0062 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4002894Z triton_mm_2056 0.0062 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4003023Z SingleProcess AUTOTUNE benchmarking takes 0.0440 seconds and 0.0592 seconds precompiling for 6 choices 2025-12-04T09:54:17.4003064Z Autotune Choices Stats: 2025-12-04T09:54:17.4003428Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2062", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4003479Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4003516Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4003565Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4003803Z triton_mm_2062 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4004040Z triton_mm_2059 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4004274Z triton_mm_2063 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4004513Z triton_mm_2058 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4004761Z triton_mm_2060 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4004993Z triton_mm_2061 0.0066 ms 90.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4005120Z SingleProcess AUTOTUNE benchmarking takes 0.0440 seconds and 0.0709 seconds precompiling for 6 choices 2025-12-04T09:54:17.4005159Z Autotune Choices Stats: 2025-12-04T09:54:17.4005531Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2067", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T09:54:17.4005573Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4005610Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4005657Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4005894Z triton_mm_2067 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4006164Z triton_mm_2068 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4006410Z triton_mm_2066 0.0059 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4006665Z triton_mm_2064 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4006898Z triton_mm_2069 0.0066 ms 89.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4007130Z triton_mm_2065 0.0067 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4007274Z SingleProcess AUTOTUNE benchmarking takes 0.0418 seconds and 0.0729 seconds precompiling for 6 choices 2025-12-04T09:54:17.4007349Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4007395Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4007453Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4007555Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4008048Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4008085Z graph_break [] 2025-12-04T09:54:17.4008129Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4008205Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4008244Z Autotune Choices Stats: 2025-12-04T09:54:17.4008617Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2073", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005878999829292297, "best_triton_pos": 0} 2025-12-04T09:54:17.4008672Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4008710Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4008754Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4008994Z triton_mm_2073 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4009228Z triton_mm_2075 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4009465Z triton_mm_2074 0.0069 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4009697Z triton_mm_2072 0.0070 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4009928Z triton_mm_2070 0.0116 ms 50.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4010179Z triton_mm_2071 0.0138 ms 42.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4010315Z SingleProcess AUTOTUNE benchmarking takes 0.0508 seconds and 0.0746 seconds precompiling for 6 choices 2025-12-04T09:54:17.4010358Z Autotune Choices Stats: 2025-12-04T09:54:17.4010726Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2079", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4010767Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4010808Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4010855Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4011100Z triton_mm_2079 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4011333Z triton_mm_2078 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4011565Z triton_mm_2080 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4011799Z triton_mm_2077 0.0074 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4012034Z triton_mm_2081 0.0075 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4012280Z triton_mm_2076 0.0090 ms 65.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4012405Z SingleProcess AUTOTUNE benchmarking takes 0.0413 seconds and 0.0844 seconds precompiling for 6 choices 2025-12-04T09:54:17.4012445Z Autotune Choices Stats: 2025-12-04T09:54:17.4012822Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2085", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T09:54:17.4012863Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4012900Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4012945Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4013181Z triton_mm_2085 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4013414Z triton_mm_2087 0.0061 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4013647Z triton_mm_2086 0.0062 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4013890Z triton_mm_2084 0.0082 ms 73.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4014133Z triton_mm_2082 0.0117 ms 51.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4014366Z triton_mm_2083 0.0132 ms 45.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4014493Z SingleProcess AUTOTUNE benchmarking takes 0.0556 seconds and 0.0852 seconds precompiling for 6 choices 2025-12-04T09:54:17.4014578Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4014622Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4014678Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4014780Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4015276Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4015314Z graph_break [] 2025-12-04T09:54:17.4015357Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4015430Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4015471Z Autotune Choices Stats: 2025-12-04T09:54:17.4015840Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2091", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006560000125318766, "best_triton_pos": 0} 2025-12-04T09:54:17.4015891Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4015950Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4015996Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4016233Z triton_mm_2091 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4016468Z triton_mm_2088 0.0066 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4016703Z triton_mm_2089 0.0067 ms 98.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4016939Z triton_mm_2090 0.0070 ms 94.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4017171Z triton_mm_2092 0.0076 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4017405Z triton_mm_2093 0.0083 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4017544Z SingleProcess AUTOTUNE benchmarking takes 0.0456 seconds and 0.0708 seconds precompiling for 6 choices 2025-12-04T09:54:17.4017584Z Autotune Choices Stats: 2025-12-04T09:54:17.4017965Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2099", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.00571999978274107, "best_triton_pos": 0} 2025-12-04T09:54:17.4018003Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4018041Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4018087Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4018336Z triton_mm_2099 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4018571Z triton_mm_2094 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4018809Z triton_mm_2097 0.0061 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4019042Z triton_mm_2095 0.0062 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4019277Z triton_mm_2098 0.0062 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4019511Z triton_mm_2096 0.0066 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4019652Z SingleProcess AUTOTUNE benchmarking takes 0.0441 seconds and 0.0800 seconds precompiling for 6 choices 2025-12-04T09:54:17.4019693Z Autotune Choices Stats: 2025-12-04T09:54:17.4020057Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2104", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4020097Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4020133Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4020180Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4020415Z triton_mm_2104 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4020649Z triton_mm_2100 0.0059 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4020887Z triton_mm_2105 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4021121Z triton_mm_2101 0.0063 ms 92.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4021376Z triton_mm_2102 0.0065 ms 89.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4021609Z triton_mm_2103 0.0086 ms 67.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4021735Z SingleProcess AUTOTUNE benchmarking takes 0.0426 seconds and 0.0853 seconds precompiling for 6 choices 2025-12-04T09:54:17.4021808Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4021851Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4021908Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4022022Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4022514Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4022553Z graph_break [] 2025-12-04T09:54:17.4022598Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4022671Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4022715Z Autotune Choices Stats: 2025-12-04T09:54:17.4023086Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2109", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T09:54:17.4023145Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4023183Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4023228Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4023462Z triton_mm_2109 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4023697Z triton_mm_2111 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4023930Z triton_mm_2106 0.0059 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4024163Z triton_mm_2107 0.0059 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4024397Z triton_mm_2108 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4024629Z triton_mm_2110 0.0077 ms 75.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4024767Z SingleProcess AUTOTUNE benchmarking takes 0.0411 seconds and 0.0732 seconds precompiling for 6 choices 2025-12-04T09:54:17.4024807Z Autotune Choices Stats: 2025-12-04T09:54:17.4025191Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2112", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4025230Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4025268Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4025314Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4025551Z triton_mm_2112 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4025796Z triton_mm_2113 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4026050Z triton_mm_2115 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4026285Z triton_mm_2114 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4026522Z triton_mm_2117 0.0068 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4026757Z triton_mm_2116 0.0084 ms 69.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4026883Z SingleProcess AUTOTUNE benchmarking takes 0.0496 seconds and 0.0740 seconds precompiling for 6 choices 2025-12-04T09:54:17.4026940Z Autotune Choices Stats: 2025-12-04T09:54:17.4027307Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2120", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4027347Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4027384Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4027429Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4027666Z triton_mm_2120 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4027899Z triton_mm_2122 0.0059 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4028132Z triton_mm_2123 0.0062 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4028362Z triton_mm_2118 0.0063 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4028609Z triton_mm_2121 0.0067 ms 86.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4028853Z triton_mm_2119 0.0071 ms 80.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4028982Z SingleProcess AUTOTUNE benchmarking takes 0.0426 seconds and 0.0683 seconds precompiling for 6 choices 2025-12-04T09:54:17.4029060Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4029101Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4029158Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4029259Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4029766Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4029803Z graph_break [] 2025-12-04T09:54:17.4029847Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4029920Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4029960Z Autotune Choices Stats: 2025-12-04T09:54:17.4030327Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2127", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005679000169038773, "best_triton_pos": 0} 2025-12-04T09:54:17.4030368Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4030405Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4030451Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4030688Z triton_mm_2127 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4030933Z triton_mm_2128 0.0058 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4031169Z triton_mm_2125 0.0058 ms 97.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4031405Z triton_mm_2129 0.0059 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4031641Z triton_mm_2124 0.0062 ms 92.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4031874Z triton_mm_2126 0.0076 ms 74.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4032001Z SingleProcess AUTOTUNE benchmarking takes 0.0423 seconds and 0.0701 seconds precompiling for 6 choices 2025-12-04T09:54:17.4032040Z Autotune Choices Stats: 2025-12-04T09:54:17.4032422Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2132", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4032473Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4032512Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4032559Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4032794Z triton_mm_2132 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4033031Z triton_mm_2133 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4033278Z triton_mm_2134 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4033513Z triton_mm_2135 0.0061 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4033746Z triton_mm_2130 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4033980Z triton_mm_2131 0.0062 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4034109Z SingleProcess AUTOTUNE benchmarking takes 0.0377 seconds and 0.0771 seconds precompiling for 6 choices 2025-12-04T09:54:17.4034149Z Autotune Choices Stats: 2025-12-04T09:54:17.4034517Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2141", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4034568Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4034606Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4034651Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4034889Z triton_mm_2141 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4035125Z triton_mm_2139 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4035363Z triton_mm_2140 0.0061 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4035595Z triton_mm_2138 0.0074 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4035828Z triton_mm_2136 0.0114 ms 50.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4036093Z triton_mm_2137 0.0147 ms 39.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4036236Z SingleProcess AUTOTUNE benchmarking takes 0.0537 seconds and 0.0721 seconds precompiling for 6 choices 2025-12-04T09:54:17.4036313Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4036354Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4036411Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4036511Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4037024Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4037062Z graph_break [] 2025-12-04T09:54:17.4037109Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4037183Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4037226Z Autotune Choices Stats: 2025-12-04T09:54:17.4037592Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2145", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4037632Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4037671Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4037716Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4037955Z triton_mm_2145 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4038200Z triton_mm_2143 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4038433Z triton_mm_2144 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4038665Z triton_mm_2142 0.0059 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4038897Z triton_mm_2147 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4039131Z triton_mm_2146 0.0064 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4039258Z SingleProcess AUTOTUNE benchmarking takes 0.0385 seconds and 0.0718 seconds precompiling for 6 choices 2025-12-04T09:54:17.4039299Z Autotune Choices Stats: 2025-12-04T09:54:17.4039670Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2153", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.00595899997279048, "best_triton_pos": 0} 2025-12-04T09:54:17.4039721Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4039758Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4039806Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4040053Z triton_mm_2153 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4040285Z triton_mm_2148 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4040536Z triton_mm_2152 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4040772Z triton_mm_2151 0.0061 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4041007Z triton_mm_2150 0.0064 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4041242Z triton_mm_2149 0.0070 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4041369Z SingleProcess AUTOTUNE benchmarking takes 0.0388 seconds and 0.0753 seconds precompiling for 6 choices 2025-12-04T09:54:17.4041410Z Autotune Choices Stats: 2025-12-04T09:54:17.4041783Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2159", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T09:54:17.4041833Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4041870Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4041914Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4042150Z triton_mm_2159 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4042383Z triton_mm_2154 0.0066 ms 91.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4042616Z triton_mm_2157 0.0067 ms 89.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4042851Z triton_mm_2158 0.0067 ms 89.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4043082Z triton_mm_2156 0.0068 ms 87.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4043317Z triton_mm_2155 0.0073 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4043454Z SingleProcess AUTOTUNE benchmarking takes 0.0433 seconds and 0.0736 seconds precompiling for 6 choices 2025-12-04T09:54:17.4043541Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4043582Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4043639Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4043739Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4044233Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4044281Z graph_break [] 2025-12-04T09:54:17.4044324Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4044399Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4044439Z Autotune Choices Stats: 2025-12-04T09:54:17.4044807Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2162", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4044845Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4044882Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4044926Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4045166Z triton_mm_2162 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4045402Z triton_mm_2165 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4045650Z triton_mm_2161 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4045882Z triton_mm_2160 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4046141Z triton_mm_2163 0.0064 ms 91.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4046376Z triton_mm_2164 0.0067 ms 88.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4046503Z SingleProcess AUTOTUNE benchmarking takes 0.0414 seconds and 0.0730 seconds precompiling for 6 choices 2025-12-04T09:54:17.4046543Z Autotune Choices Stats: 2025-12-04T09:54:17.4046914Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2168", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.4046968Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4047006Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4047054Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4047304Z triton_mm_2168 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4047543Z triton_mm_2171 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4047776Z triton_mm_2170 0.0068 ms 86.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4048022Z triton_mm_2169 0.0068 ms 86.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4048257Z triton_mm_2167 0.0165 ms 35.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4048489Z triton_mm_2166 0.0168 ms 34.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4048615Z SingleProcess AUTOTUNE benchmarking takes 0.0552 seconds and 0.0874 seconds precompiling for 6 choices 2025-12-04T09:54:17.4048655Z Autotune Choices Stats: 2025-12-04T09:54:17.4049021Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2177", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.4049060Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4049112Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4049157Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4049394Z triton_mm_2177 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4049632Z triton_mm_2176 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4049867Z triton_mm_2172 0.0063 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4050101Z triton_mm_2174 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4050334Z triton_mm_2175 0.0068 ms 86.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4050567Z triton_mm_2173 0.0071 ms 82.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4050703Z SingleProcess AUTOTUNE benchmarking takes 0.0415 seconds and 0.0780 seconds precompiling for 6 choices 2025-12-04T09:54:17.4050778Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4050820Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4050877Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4050997Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4051490Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4051527Z graph_break [] 2025-12-04T09:54:17.4051570Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4051655Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4051695Z Autotune Choices Stats: 2025-12-04T09:54:17.4052065Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2180", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4052103Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4052141Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4052186Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4052424Z triton_mm_2180 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4052658Z triton_mm_2179 0.0066 ms 90.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4052894Z triton_mm_2183 0.0067 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4053138Z triton_mm_2181 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4053369Z triton_mm_2182 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4053607Z triton_mm_2178 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4053734Z SingleProcess AUTOTUNE benchmarking takes 0.0431 seconds and 0.0667 seconds precompiling for 6 choices 2025-12-04T09:54:17.4053775Z Autotune Choices Stats: 2025-12-04T09:54:17.4054141Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2187", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T09:54:17.4054181Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4054218Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4054265Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4054513Z triton_mm_2187 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4054758Z triton_mm_2184 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4054995Z triton_mm_2189 0.0064 ms 90.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4055227Z triton_mm_2188 0.0067 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4055473Z triton_mm_2185 0.0068 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4055709Z triton_mm_2186 0.0068 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4055838Z SingleProcess AUTOTUNE benchmarking takes 0.0429 seconds and 0.0745 seconds precompiling for 6 choices 2025-12-04T09:54:17.4055877Z Autotune Choices Stats: 2025-12-04T09:54:17.4056280Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2190", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4056321Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4056358Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4056402Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4056638Z triton_mm_2190 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4056889Z triton_mm_2192 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4057121Z triton_mm_2193 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4057356Z triton_mm_2195 0.0062 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4057590Z triton_mm_2194 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4057825Z triton_mm_2191 0.0073 ms 81.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4057951Z SingleProcess AUTOTUNE benchmarking takes 0.0433 seconds and 0.0805 seconds precompiling for 6 choices 2025-12-04T09:54:17.4058024Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4058080Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4058137Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4058239Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4058845Z inductor [('triton_bundler_save_kernel', 168), ('benchmarking.InductorBenchmarker.benchmark_gpu', 23), ('async_compile_cache_miss', 15), ('benchmarking.InductorBenchmarker.benchmark', 15), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('async_compile_cache_hit', 3), ('pad_mm_bench', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4058885Z graph_break [] 2025-12-04T09:54:17.4058927Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.4059003Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4059043Z Autotune Choices Stats: 2025-12-04T09:54:17.4059423Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2197", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4059463Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4059500Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4059544Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4059781Z triton_mm_2197 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4060018Z triton_mm_2196 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4060252Z triton_mm_2200 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4060498Z triton_mm_2198 0.0061 ms 94.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4060729Z triton_mm_2199 0.0062 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4060962Z triton_mm_2201 0.0066 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4061087Z SingleProcess AUTOTUNE benchmarking takes 0.0375 seconds and 0.0600 seconds precompiling for 6 choices 2025-12-04T09:54:17.4061129Z Autotune Choices Stats: 2025-12-04T09:54:17.4061496Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2205", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4061536Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4061574Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4061620Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4061859Z triton_mm_2205 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4062112Z triton_mm_2206 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4062348Z triton_mm_2202 0.0058 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4062578Z triton_mm_2204 0.0060 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4062824Z triton_mm_2207 0.0060 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4063059Z triton_mm_2203 0.0063 ms 91.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4063186Z SingleProcess AUTOTUNE benchmarking takes 0.0388 seconds and 0.0770 seconds precompiling for 6 choices 2025-12-04T09:54:17.4063271Z __________________ TestPatternMatcher.test_mixed_mm_epi_works __________________ 2025-12-04T09:54:17.4063316Z Traceback (most recent call last): 2025-12-04T09:54:17.4063463Z File "/var/lib/jenkins/pytorch/test/inductor/test_pattern_matcher.py", line 463, in test_mixed_mm_epi_works 2025-12-04T09:54:17.4063518Z self._test_mixed_impl(fn, args, True, False) 2025-12-04T09:54:17.4063646Z File "/var/lib/jenkins/pytorch/test/inductor/test_pattern_matcher.py", line 325, in _test_mixed_impl 2025-12-04T09:54:17.4063724Z FileCheck().check("k_idx").check(".to(").check("tl.dot").run(code) 2025-12-04T09:54:17.4063796Z RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:54:17.4063833Z Searched string: 2025-12-04T09:54:17.4063908Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:54:17.4063910Z 2025-12-04T09:54:17.4063961Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:54:17.4063964Z 2025-12-04T09:54:17.4064024Z a_mask = offs_k[None, :] < (K - k_idx * BLOCK_K) 2025-12-04T09:54:17.4064079Z b_mask = offs_k[:, None] < (K - k_idx * BLOCK_K) 2025-12-04T09:54:17.4064081Z 2025-12-04T09:54:17.4064142Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:54:17.4064198Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:54:17.4066376Z 2025-12-04T09:54:17.4066427Z idx_m = offs_a_m[:, None] 2025-12-04T09:54:17.4066474Z idx_n = a_k_idx_vals 2025-12-04T09:54:17.4066515Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.4066581Z a = tl.load(A + (xindex), mask=a_mask, other=0.0) 2025-12-04T09:54:17.4066583Z 2025-12-04T09:54:17.4066622Z idx_m = b_k_idx_vals 2025-12-04T09:54:17.4066663Z idx_n = offs_b_n[None, :] 2025-12-04T09:54:17.4066707Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.4066762Z b = tl.load(B + (xindex), mask=b_mask, other=0.0) 2025-12-04T09:54:17.4066764Z 2025-12-04T09:54:17.4066766Z 2025-12-04T09:54:17.4066836Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:54:17.4066839Z 2025-12-04T09:54:17.4066840Z 2025-12-04T09:54:17.4066895Z # rematerialize rm and rn to save registers 2025-12-04T09:54:17.4066946Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:54:17.4066997Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:54:17.4067036Z idx_m = rm[:, None] 2025-12-04T09:54:17.4067119Z idx_n = rn[None, :] 2025-12-04T09:54:17.4067161Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:54:17.4067164Z 2025-12-04T09:54:17.4067207Z # inductor generates a suffix 2025-12-04T09:54:17.4067246Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.4067366Z tmp0 = tl.load(in_ptr2 + (tl.broadcast_to(idx_n, [BLOCK_M, BLOCK_N])), mask, eviction_policy='evict_last') 2025-12-04T09:54:17.4067489Z tmp2 = tl.load(in_ptr3 + (tl.broadcast_to(idx_n, [BLOCK_M, BLOCK_N])), mask, eviction_policy='evict_last') 2025-12-04T09:54:17.4067530Z tmp1 = acc * tmp0 2025-12-04T09:54:17.4067568Z tmp3 = tmp1 + tmp2 2025-12-04T09:54:17.4067667Z tl.store(out_ptr1 + (tl.broadcast_to(idx_n + 8*idx_m, [BLOCK_M, BLOCK_N])), tmp3, mask) 2025-12-04T09:54:17.4067705Z ''', device_str='cuda') 2025-12-04T09:54:17.4067707Z 2025-12-04T09:54:17.4067709Z 2025-12-04T09:54:17.4067753Z async_compile.wait(globals()) 2025-12-04T09:54:17.4067791Z del async_compile 2025-12-04T09:54:17.4067793Z 2025-12-04T09:54:17.4067831Z class Runner: 2025-12-04T09:54:17.4067877Z def __init__(self, partitions): 2025-12-04T09:54:17.4067942Z self.partitions = partitions 2025-12-04T09:54:17.4067944Z 2025-12-04T09:54:17.4067993Z def recursively_apply_fns(self, fns): 2025-12-04T09:54:17.4068035Z new_callables = [] 2025-12-04T09:54:17.4068089Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:54:17.4068139Z new_callables.append(fn(c)) 2025-12-04T09:54:17.4068185Z self.partitions = new_callables 2025-12-04T09:54:17.4068187Z 2025-12-04T09:54:17.4068231Z def call(self, args): 2025-12-04T09:54:17.4068277Z arg0_1, arg1_1, arg2_1, arg3_1 = args 2025-12-04T09:54:17.4068314Z args.clear() 2025-12-04T09:54:17.4068363Z assert_size_stride(arg0_1, (8, 8), (8, 1)) 2025-12-04T09:54:17.4068412Z assert_size_stride(arg1_1, (8, 8), (8, 1)) 2025-12-04T09:54:17.4068457Z assert_size_stride(arg2_1, (8, ), (1, )) 2025-12-04T09:54:17.4068504Z assert_size_stride(arg3_1, (8, ), (1, )) 2025-12-04T09:54:17.4068552Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:54:17.4068596Z torch.cuda.set_device(0) 2025-12-04T09:54:17.4068664Z buf0 = empty_strided_cuda((8, 8), (8, 1), torch.float32) 2025-12-04T09:54:17.4068754Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:54:17.4068814Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.4068893Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 64, stream=stream0) 2025-12-04T09:54:17.4068928Z del arg0_1 2025-12-04T09:54:17.4068993Z buf2 = empty_strided_cuda((8, 8), (8, 1), torch.float32) 2025-12-04T09:54:17.4069128Z # Topologically Sorted Source Nodes: [to, mm, mul, add], Original ATen: [aten._to_copy, aten.mm, aten.mul, aten.add] 2025-12-04T09:54:17.4069172Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.4069287Z triton_tem_fused__to_copy_add_mm_mul_1.run(arg1_1, buf0, arg2_1, arg3_1, buf2, 1, 1, 1, stream=stream0) 2025-12-04T09:54:17.4069326Z del arg1_1 2025-12-04T09:54:17.4069364Z del arg2_1 2025-12-04T09:54:17.4069402Z del arg3_1 2025-12-04T09:54:17.4069436Z del buf0 2025-12-04T09:54:17.4069474Z return (buf2, ) 2025-12-04T09:54:17.4069479Z 2025-12-04T09:54:17.4069525Z runner = Runner(partitions=[]) 2025-12-04T09:54:17.4069563Z call = runner.call 2025-12-04T09:54:17.4069630Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:54:17.4069633Z 2025-12-04T09:54:17.4069634Z 2025-12-04T09:54:17.4069696Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:54:17.4069750Z from torch._dynamo.testing import rand_strided 2025-12-04T09:54:17.4069817Z from torch._inductor.utils import print_performance 2025-12-04T09:54:17.4069895Z arg0_1 = rand_strided((8, 8), (8, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:54:17.4069971Z arg1_1 = rand_strided((8, 8), (8, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:54:17.4070056Z arg2_1 = rand_strided((8, ), (1, ), device='cuda:0', dtype=torch.float32) 2025-12-04T09:54:17.4070131Z arg3_1 = rand_strided((8, ), (1, ), device='cuda:0', dtype=torch.float32) 2025-12-04T09:54:17.4070187Z fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1]) 2025-12-04T09:54:17.4070271Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:54:17.4070273Z 2025-12-04T09:54:17.4070275Z 2025-12-04T09:54:17.4070317Z if __name__ == "__main__": 2025-12-04T09:54:17.4070399Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:54:17.4070467Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:54:17.4070503Z From CHECK: .to( 2025-12-04T09:54:17.4070505Z 2025-12-04T09:54:17.4070507Z 2025-12-04T09:54:17.4070581Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:17.4070726Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_epi_works 2025-12-04T09:54:17.4070729Z 2025-12-04T09:54:17.4070829Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:17.4070905Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4070948Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4071007Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4071109Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4071608Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4071645Z graph_break [] 2025-12-04T09:54:17.4071690Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4071766Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4071807Z Autotune Choices Stats: 2025-12-04T09:54:17.4072185Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_36", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4072240Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4072277Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4072323Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4072565Z triton_mm_36 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4072800Z triton_mm_40 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4073033Z triton_mm_38 0.0067 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4073264Z triton_mm_39 0.0067 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4073495Z triton_mm_41 0.0067 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4073738Z triton_mm_37 0.0069 ms 84.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4073880Z SingleProcess AUTOTUNE benchmarking takes 0.5618 seconds and 0.0833 seconds precompiling for 6 choices 2025-12-04T09:54:17.4073920Z Autotune Choices Stats: 2025-12-04T09:54:17.4074291Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_47", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.4074330Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4074367Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4074416Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4074663Z triton_mm_47 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4074895Z triton_mm_46 0.0059 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4075125Z triton_mm_43 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4075355Z triton_mm_44 0.0067 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4075588Z triton_mm_45 0.0068 ms 86.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4075833Z triton_mm_42 0.0075 ms 78.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4075990Z SingleProcess AUTOTUNE benchmarking takes 0.0396 seconds and 0.0564 seconds precompiling for 6 choices 2025-12-04T09:54:17.4076032Z Autotune Choices Stats: 2025-12-04T09:54:17.4076402Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_53", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006719999946653843, "best_triton_pos": 0} 2025-12-04T09:54:17.4076441Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4076478Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4076525Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4076762Z triton_mm_53 0.0067 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4076994Z triton_mm_51 0.0068 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4077224Z triton_mm_52 0.0068 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4077472Z triton_mm_50 0.0119 ms 56.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4077716Z triton_mm_49 0.0134 ms 50.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4077946Z triton_mm_48 0.0163 ms 41.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4078073Z SingleProcess AUTOTUNE benchmarking takes 0.0536 seconds and 0.0777 seconds precompiling for 6 choices 2025-12-04T09:54:17.4078147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4078206Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4078264Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4078366Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4078864Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4078902Z graph_break [] 2025-12-04T09:54:17.4078945Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4079020Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4079061Z Autotune Choices Stats: 2025-12-04T09:54:17.4079432Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2016", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T09:54:17.4079492Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4079530Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4079576Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4079813Z triton_mm_2016 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4080047Z triton_mm_2017 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4080284Z triton_mm_2021 0.0062 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4080523Z triton_mm_2018 0.0062 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4080754Z triton_mm_2019 0.0068 ms 86.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4080986Z triton_mm_2020 0.0076 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4081122Z SingleProcess AUTOTUNE benchmarking takes 0.0426 seconds and 0.0778 seconds precompiling for 6 choices 2025-12-04T09:54:17.4081162Z Autotune Choices Stats: 2025-12-04T09:54:17.4081540Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2022", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.00571999978274107, "best_triton_pos": 0} 2025-12-04T09:54:17.4081578Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4081616Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4081662Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4081901Z triton_mm_2022 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4082147Z triton_mm_2026 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4082381Z triton_mm_2024 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4082619Z triton_mm_2027 0.0060 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4082852Z triton_mm_2025 0.0061 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4083086Z triton_mm_2023 0.0062 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4083223Z SingleProcess AUTOTUNE benchmarking takes 0.0428 seconds and 0.0751 seconds precompiling for 6 choices 2025-12-04T09:54:17.4083265Z Autotune Choices Stats: 2025-12-04T09:54:17.4083630Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2031", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4083669Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4083707Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4083752Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4083988Z triton_mm_2031 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4084223Z triton_mm_2032 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4084459Z triton_mm_2033 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4084694Z triton_mm_2030 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4084935Z triton_mm_2029 0.0129 ms 46.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4085179Z triton_mm_2028 0.0165 ms 35.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4085304Z SingleProcess AUTOTUNE benchmarking takes 0.0557 seconds and 0.0847 seconds precompiling for 6 choices 2025-12-04T09:54:17.4085378Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4085421Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4085478Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4085591Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4086150Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4086187Z graph_break [] 2025-12-04T09:54:17.4086231Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4086305Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4086346Z Autotune Choices Stats: 2025-12-04T09:54:17.4086720Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2036", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4086760Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4086797Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4086860Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4087098Z triton_mm_2036 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4087333Z triton_mm_2034 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4087565Z triton_mm_2037 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4087802Z triton_mm_2035 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4088037Z triton_mm_2039 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4088269Z triton_mm_2038 0.0066 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4088394Z SingleProcess AUTOTUNE benchmarking takes 0.0410 seconds and 0.0600 seconds precompiling for 6 choices 2025-12-04T09:54:17.4088451Z Autotune Choices Stats: 2025-12-04T09:54:17.4088838Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2043", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4088879Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4088916Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4088962Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4089199Z triton_mm_2043 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4089452Z triton_mm_2042 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4089687Z triton_mm_2045 0.0060 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4089919Z triton_mm_2044 0.0061 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4090151Z triton_mm_2041 0.0076 ms 76.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4090385Z triton_mm_2040 0.0156 ms 37.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4090510Z SingleProcess AUTOTUNE benchmarking takes 0.0513 seconds and 0.0748 seconds precompiling for 6 choices 2025-12-04T09:54:17.4090564Z Autotune Choices Stats: 2025-12-04T09:54:17.4090939Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2049", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4090977Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4091015Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4091059Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4091296Z triton_mm_2049 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4091530Z triton_mm_2048 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4091764Z triton_mm_2051 0.0061 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4091995Z triton_mm_2047 0.0062 ms 94.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4092233Z triton_mm_2050 0.0064 ms 91.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4092486Z triton_mm_2046 0.0069 ms 86.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4092613Z SingleProcess AUTOTUNE benchmarking takes 0.0424 seconds and 0.0784 seconds precompiling for 6 choices 2025-12-04T09:54:17.4092688Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4092730Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4092787Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4092888Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4093406Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4093443Z graph_break [] 2025-12-04T09:54:17.4093487Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4093560Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4093601Z Autotune Choices Stats: 2025-12-04T09:54:17.4093967Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2054", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.00571999978274107, "best_triton_pos": 0} 2025-12-04T09:54:17.4094007Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4094045Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4094091Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4094330Z triton_mm_2054 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4094577Z triton_mm_2052 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4094813Z triton_mm_2057 0.0059 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4095049Z triton_mm_2055 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4095283Z triton_mm_2053 0.0062 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4095515Z triton_mm_2056 0.0062 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4095641Z SingleProcess AUTOTUNE benchmarking takes 0.0440 seconds and 0.0592 seconds precompiling for 6 choices 2025-12-04T09:54:17.4095680Z Autotune Choices Stats: 2025-12-04T09:54:17.4096071Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2062", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4096127Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4096175Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4096223Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4096459Z triton_mm_2062 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4096691Z triton_mm_2059 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4096939Z triton_mm_2063 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4097176Z triton_mm_2058 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4097410Z triton_mm_2060 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4097644Z triton_mm_2061 0.0066 ms 90.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4097772Z SingleProcess AUTOTUNE benchmarking takes 0.0440 seconds and 0.0709 seconds precompiling for 6 choices 2025-12-04T09:54:17.4097811Z Autotune Choices Stats: 2025-12-04T09:54:17.4098179Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2067", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T09:54:17.4098229Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4098266Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4098310Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4098546Z triton_mm_2067 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4098780Z triton_mm_2068 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4099016Z triton_mm_2066 0.0059 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4099249Z triton_mm_2064 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4099481Z triton_mm_2069 0.0066 ms 89.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4099723Z triton_mm_2065 0.0067 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4099858Z SingleProcess AUTOTUNE benchmarking takes 0.0418 seconds and 0.0729 seconds precompiling for 6 choices 2025-12-04T09:54:17.4099935Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4099977Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4100034Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4100133Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4100636Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4100674Z graph_break [] 2025-12-04T09:54:17.4100718Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4100794Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4100834Z Autotune Choices Stats: 2025-12-04T09:54:17.4101206Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2073", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005878999829292297, "best_triton_pos": 0} 2025-12-04T09:54:17.4101244Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4101282Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4101326Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4101564Z triton_mm_2073 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4101797Z triton_mm_2075 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4102041Z triton_mm_2074 0.0069 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4102273Z triton_mm_2072 0.0070 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4102508Z triton_mm_2070 0.0116 ms 50.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4102742Z triton_mm_2071 0.0138 ms 42.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4102866Z SingleProcess AUTOTUNE benchmarking takes 0.0508 seconds and 0.0746 seconds precompiling for 6 choices 2025-12-04T09:54:17.4102907Z Autotune Choices Stats: 2025-12-04T09:54:17.4103277Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2079", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4103329Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4103365Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4103412Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4103669Z triton_mm_2079 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4103903Z triton_mm_2078 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4104135Z triton_mm_2080 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4104378Z triton_mm_2077 0.0074 ms 79.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4104613Z triton_mm_2081 0.0075 ms 79.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4104844Z triton_mm_2076 0.0090 ms 65.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4104969Z SingleProcess AUTOTUNE benchmarking takes 0.0413 seconds and 0.0844 seconds precompiling for 6 choices 2025-12-04T09:54:17.4105010Z Autotune Choices Stats: 2025-12-04T09:54:17.4105380Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2085", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T09:54:17.4105431Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4105468Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4105512Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4105749Z triton_mm_2085 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4106016Z triton_mm_2087 0.0061 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4106251Z triton_mm_2086 0.0062 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4106486Z triton_mm_2084 0.0082 ms 73.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4106716Z triton_mm_2082 0.0117 ms 51.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4106950Z triton_mm_2083 0.0132 ms 45.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4107095Z SingleProcess AUTOTUNE benchmarking takes 0.0556 seconds and 0.0852 seconds precompiling for 6 choices 2025-12-04T09:54:17.4107171Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4107233Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4107292Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4107392Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4107886Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4107924Z graph_break [] 2025-12-04T09:54:17.4107982Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4108058Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4108097Z Autotune Choices Stats: 2025-12-04T09:54:17.4108466Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2091", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006560000125318766, "best_triton_pos": 0} 2025-12-04T09:54:17.4108504Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4108541Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4108585Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4108824Z triton_mm_2091 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4109058Z triton_mm_2088 0.0066 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4109307Z triton_mm_2089 0.0067 ms 98.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4109540Z triton_mm_2090 0.0070 ms 94.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4109774Z triton_mm_2092 0.0076 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4110007Z triton_mm_2093 0.0083 ms 78.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4110134Z SingleProcess AUTOTUNE benchmarking takes 0.0456 seconds and 0.0708 seconds precompiling for 6 choices 2025-12-04T09:54:17.4110175Z Autotune Choices Stats: 2025-12-04T09:54:17.4110541Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2099", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.00571999978274107, "best_triton_pos": 0} 2025-12-04T09:54:17.4110580Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4110628Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4110676Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4110913Z triton_mm_2099 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4111160Z triton_mm_2094 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4111398Z triton_mm_2097 0.0061 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4111661Z triton_mm_2095 0.0062 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4111897Z triton_mm_2098 0.0062 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4112130Z triton_mm_2096 0.0066 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4112256Z SingleProcess AUTOTUNE benchmarking takes 0.0441 seconds and 0.0800 seconds precompiling for 6 choices 2025-12-04T09:54:17.4112295Z Autotune Choices Stats: 2025-12-04T09:54:17.4112661Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2104", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4112702Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4112739Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4112799Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4113036Z triton_mm_2104 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4113269Z triton_mm_2100 0.0059 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4113508Z triton_mm_2105 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4113743Z triton_mm_2101 0.0063 ms 92.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4113975Z triton_mm_2102 0.0065 ms 89.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4114207Z triton_mm_2103 0.0086 ms 67.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4114332Z SingleProcess AUTOTUNE benchmarking takes 0.0426 seconds and 0.0853 seconds precompiling for 6 choices 2025-12-04T09:54:17.4114417Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4114459Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4114515Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4114626Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4115118Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4115155Z graph_break [] 2025-12-04T09:54:17.4115197Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4115273Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4115323Z Autotune Choices Stats: 2025-12-04T09:54:17.4115693Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2109", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T09:54:17.4115731Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4115769Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4115813Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4116080Z triton_mm_2109 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4116319Z triton_mm_2111 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4116555Z triton_mm_2106 0.0059 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4116803Z triton_mm_2107 0.0059 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4117034Z triton_mm_2108 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4117269Z triton_mm_2110 0.0077 ms 75.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4117397Z SingleProcess AUTOTUNE benchmarking takes 0.0411 seconds and 0.0732 seconds precompiling for 6 choices 2025-12-04T09:54:17.4117439Z Autotune Choices Stats: 2025-12-04T09:54:17.4117808Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2112", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4117847Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4117884Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4117930Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4118169Z triton_mm_2112 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4118431Z triton_mm_2113 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4118668Z triton_mm_2115 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4118900Z triton_mm_2114 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4119145Z triton_mm_2117 0.0068 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4119380Z triton_mm_2116 0.0084 ms 69.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4119506Z SingleProcess AUTOTUNE benchmarking takes 0.0496 seconds and 0.0740 seconds precompiling for 6 choices 2025-12-04T09:54:17.4119548Z Autotune Choices Stats: 2025-12-04T09:54:17.4119916Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2120", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4119956Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4119994Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4120038Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4120274Z triton_mm_2120 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4120518Z triton_mm_2122 0.0059 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4120750Z triton_mm_2123 0.0062 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4120984Z triton_mm_2118 0.0063 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4121218Z triton_mm_2121 0.0067 ms 86.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4121454Z triton_mm_2119 0.0071 ms 80.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4121580Z SingleProcess AUTOTUNE benchmarking takes 0.0426 seconds and 0.0683 seconds precompiling for 6 choices 2025-12-04T09:54:17.4121654Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4121714Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4121770Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4121872Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4122373Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4122412Z graph_break [] 2025-12-04T09:54:17.4122454Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4122528Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4122567Z Autotune Choices Stats: 2025-12-04T09:54:17.4122943Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2127", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005679000169038773, "best_triton_pos": 0} 2025-12-04T09:54:17.4122985Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4123023Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4123068Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4123305Z triton_mm_2127 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4123539Z triton_mm_2128 0.0058 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4123774Z triton_mm_2125 0.0058 ms 97.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4124012Z triton_mm_2129 0.0059 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4124254Z triton_mm_2124 0.0062 ms 92.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4124486Z triton_mm_2126 0.0076 ms 74.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4124616Z SingleProcess AUTOTUNE benchmarking takes 0.0423 seconds and 0.0701 seconds precompiling for 6 choices 2025-12-04T09:54:17.4124655Z Autotune Choices Stats: 2025-12-04T09:54:17.4125021Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2132", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4125059Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4125096Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4125142Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4125378Z triton_mm_2132 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4125627Z triton_mm_2133 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4125872Z triton_mm_2134 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4126141Z triton_mm_2135 0.0061 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4126374Z triton_mm_2130 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4126622Z triton_mm_2131 0.0062 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4126749Z SingleProcess AUTOTUNE benchmarking takes 0.0377 seconds and 0.0771 seconds precompiling for 6 choices 2025-12-04T09:54:17.4126791Z Autotune Choices Stats: 2025-12-04T09:54:17.4127155Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2141", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4127194Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4127230Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4127275Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4127514Z triton_mm_2141 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4127751Z triton_mm_2139 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4127999Z triton_mm_2140 0.0061 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4128231Z triton_mm_2138 0.0074 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4128466Z triton_mm_2136 0.0114 ms 50.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4128699Z triton_mm_2137 0.0147 ms 39.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4128825Z SingleProcess AUTOTUNE benchmarking takes 0.0537 seconds and 0.0721 seconds precompiling for 6 choices 2025-12-04T09:54:17.4128900Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4128942Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4128999Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4129101Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4129626Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4129663Z graph_break [] 2025-12-04T09:54:17.4129707Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4129781Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4129821Z Autotune Choices Stats: 2025-12-04T09:54:17.4130197Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2145", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4130237Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4130274Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4130320Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4130557Z triton_mm_2145 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4130790Z triton_mm_2143 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4131020Z triton_mm_2144 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4131254Z triton_mm_2142 0.0059 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4131497Z triton_mm_2147 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4131733Z triton_mm_2146 0.0064 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4131860Z SingleProcess AUTOTUNE benchmarking takes 0.0385 seconds and 0.0718 seconds precompiling for 6 choices 2025-12-04T09:54:17.4131900Z Autotune Choices Stats: 2025-12-04T09:54:17.4132268Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2153", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.00595899997279048, "best_triton_pos": 0} 2025-12-04T09:54:17.4132307Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4132345Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4132391Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4132628Z triton_mm_2153 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4132861Z triton_mm_2148 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4133107Z triton_mm_2152 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4133353Z triton_mm_2151 0.0061 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4133584Z triton_mm_2150 0.0064 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4133821Z triton_mm_2149 0.0070 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4133958Z SingleProcess AUTOTUNE benchmarking takes 0.0388 seconds and 0.0753 seconds precompiling for 6 choices 2025-12-04T09:54:17.4133999Z Autotune Choices Stats: 2025-12-04T09:54:17.4134367Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2159", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T09:54:17.4134406Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4134444Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4134487Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4134724Z triton_mm_2159 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4134960Z triton_mm_2154 0.0066 ms 91.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4135209Z triton_mm_2157 0.0067 ms 89.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4135440Z triton_mm_2158 0.0067 ms 89.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4135674Z triton_mm_2156 0.0068 ms 87.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4135912Z triton_mm_2155 0.0073 ms 82.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4136067Z SingleProcess AUTOTUNE benchmarking takes 0.0433 seconds and 0.0736 seconds precompiling for 6 choices 2025-12-04T09:54:17.4136141Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4136182Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4136239Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4136339Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4136834Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4136885Z graph_break [] 2025-12-04T09:54:17.4136942Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4137016Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4137056Z Autotune Choices Stats: 2025-12-04T09:54:17.4137422Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2162", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4137462Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4137498Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4137556Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4137794Z triton_mm_2162 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4138035Z triton_mm_2165 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4138271Z triton_mm_2161 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4138504Z triton_mm_2160 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4138739Z triton_mm_2163 0.0064 ms 91.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4138986Z triton_mm_2164 0.0067 ms 88.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4139113Z SingleProcess AUTOTUNE benchmarking takes 0.0414 seconds and 0.0730 seconds precompiling for 6 choices 2025-12-04T09:54:17.4139152Z Autotune Choices Stats: 2025-12-04T09:54:17.4139519Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2168", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.4139558Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4139595Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4139644Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4139880Z triton_mm_2168 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4140121Z triton_mm_2171 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4140355Z triton_mm_2170 0.0068 ms 86.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4140609Z triton_mm_2169 0.0068 ms 86.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4140842Z triton_mm_2167 0.0165 ms 35.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4141075Z triton_mm_2166 0.0168 ms 34.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4141201Z SingleProcess AUTOTUNE benchmarking takes 0.0552 seconds and 0.0874 seconds precompiling for 6 choices 2025-12-04T09:54:17.4141241Z Autotune Choices Stats: 2025-12-04T09:54:17.4141618Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2177", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.4141657Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4141694Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4141739Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4141977Z triton_mm_2177 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4142214Z triton_mm_2176 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4142449Z triton_mm_2172 0.0063 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4142695Z triton_mm_2174 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4142927Z triton_mm_2175 0.0068 ms 86.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4143160Z triton_mm_2173 0.0071 ms 82.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4143285Z SingleProcess AUTOTUNE benchmarking takes 0.0415 seconds and 0.0780 seconds precompiling for 6 choices 2025-12-04T09:54:17.4143362Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4143403Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4143459Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4143559Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4144055Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4144102Z graph_break [] 2025-12-04T09:54:17.4144146Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4144221Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4144274Z Autotune Choices Stats: 2025-12-04T09:54:17.4144641Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2180", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4144679Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4144716Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4144760Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4145009Z triton_mm_2180 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4145244Z triton_mm_2179 0.0066 ms 90.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4145478Z triton_mm_2183 0.0067 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4145710Z triton_mm_2181 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4145975Z triton_mm_2182 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4146211Z triton_mm_2178 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4146355Z SingleProcess AUTOTUNE benchmarking takes 0.0431 seconds and 0.0667 seconds precompiling for 6 choices 2025-12-04T09:54:17.4146395Z Autotune Choices Stats: 2025-12-04T09:54:17.4146761Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2187", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T09:54:17.4146800Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4146838Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4146885Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4147123Z triton_mm_2187 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4147356Z triton_mm_2184 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4147586Z triton_mm_2189 0.0064 ms 90.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4147831Z triton_mm_2188 0.0067 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4148078Z triton_mm_2185 0.0068 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4148316Z triton_mm_2186 0.0068 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4148441Z SingleProcess AUTOTUNE benchmarking takes 0.0429 seconds and 0.0745 seconds precompiling for 6 choices 2025-12-04T09:54:17.4148480Z Autotune Choices Stats: 2025-12-04T09:54:17.4148861Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2190", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4148901Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4148939Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4148983Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4149219Z triton_mm_2190 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4149451Z triton_mm_2192 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4149686Z triton_mm_2193 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4149919Z triton_mm_2195 0.0062 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4150165Z triton_mm_2194 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4150399Z triton_mm_2191 0.0073 ms 81.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4150526Z SingleProcess AUTOTUNE benchmarking takes 0.0433 seconds and 0.0805 seconds precompiling for 6 choices 2025-12-04T09:54:17.4150600Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4150642Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4150699Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4150800Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4151388Z inductor [('triton_bundler_save_kernel', 168), ('benchmarking.InductorBenchmarker.benchmark_gpu', 23), ('async_compile_cache_miss', 15), ('benchmarking.InductorBenchmarker.benchmark', 15), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('async_compile_cache_hit', 3), ('pad_mm_bench', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4151436Z graph_break [] 2025-12-04T09:54:17.4151479Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.4151553Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4151592Z Autotune Choices Stats: 2025-12-04T09:54:17.4151968Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2197", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4152006Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4152044Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4152088Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4152330Z triton_mm_2197 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4152573Z triton_mm_2196 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4152811Z triton_mm_2200 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4153045Z triton_mm_2198 0.0061 ms 94.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4153278Z triton_mm_2199 0.0062 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4153513Z triton_mm_2201 0.0066 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4153650Z SingleProcess AUTOTUNE benchmarking takes 0.0375 seconds and 0.0600 seconds precompiling for 6 choices 2025-12-04T09:54:17.4153690Z Autotune Choices Stats: 2025-12-04T09:54:17.4154056Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2205", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4154095Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4154133Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4154180Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4154419Z triton_mm_2205 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4154655Z triton_mm_2206 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4154888Z triton_mm_2202 0.0058 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4155122Z triton_mm_2204 0.0060 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4155377Z triton_mm_2207 0.0060 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4155611Z triton_mm_2203 0.0063 ms 91.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4155737Z SingleProcess AUTOTUNE benchmarking takes 0.0388 seconds and 0.0770 seconds precompiling for 6 choices 2025-12-04T09:54:17.4155810Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4155852Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4155908Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4156052Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4156554Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4156591Z graph_break [] 2025-12-04T09:54:17.4156634Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4156708Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4156748Z Autotune Choices Stats: 2025-12-04T09:54:17.4157115Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2211", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4157154Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4157191Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4157255Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4157491Z triton_mm_2211 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4157724Z triton_mm_2210 0.0059 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4157956Z triton_mm_2208 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4158187Z triton_mm_2209 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4158427Z triton_mm_2213 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4158662Z triton_mm_2212 0.0061 ms 94.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4158787Z SingleProcess AUTOTUNE benchmarking takes 0.0370 seconds and 0.0729 seconds precompiling for 6 choices 2025-12-04T09:54:17.4158848Z Autotune Choices Stats: 2025-12-04T09:54:17.4159225Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2218", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005679999943822622, "best_triton_pos": 0} 2025-12-04T09:54:17.4159266Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4159303Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4159349Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4159586Z triton_mm_2218 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4159830Z triton_mm_2216 0.0057 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4160064Z triton_mm_2219 0.0060 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4160297Z triton_mm_2214 0.0061 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4160534Z triton_mm_2215 0.0061 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4160770Z triton_mm_2217 0.0061 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4160895Z SingleProcess AUTOTUNE benchmarking takes 0.0384 seconds and 0.0768 seconds precompiling for 6 choices 2025-12-04T09:54:17.4160951Z Autotune Choices Stats: 2025-12-04T09:54:17.4161316Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2223", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005679999943822622, "best_triton_pos": 0} 2025-12-04T09:54:17.4161354Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4161391Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4161435Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4161674Z triton_mm_2223 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4161908Z triton_mm_2222 0.0057 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4162144Z triton_mm_2225 0.0057 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4162376Z triton_mm_2221 0.0058 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4162621Z triton_mm_2220 0.0060 ms 94.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4162880Z triton_mm_2224 0.0064 ms 89.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4163006Z SingleProcess AUTOTUNE benchmarking takes 0.0390 seconds and 0.0763 seconds precompiling for 6 choices 2025-12-04T09:54:17.4163080Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4163121Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4163178Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4163277Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4163784Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4163822Z graph_break [] 2025-12-04T09:54:17.4163864Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4163938Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4163977Z Autotune Choices Stats: 2025-12-04T09:54:17.4164342Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2228", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4164381Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4164419Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4164464Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4164703Z triton_mm_2228 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4164954Z triton_mm_2226 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4165186Z triton_mm_2230 0.0058 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4165419Z triton_mm_2227 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4165652Z triton_mm_2229 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4165885Z triton_mm_2231 0.0063 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4166040Z SingleProcess AUTOTUNE benchmarking takes 0.0411 seconds and 0.0726 seconds precompiling for 6 choices 2025-12-04T09:54:17.4166080Z Autotune Choices Stats: 2025-12-04T09:54:17.4166446Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005640000104904175, "best_triton_pos": 0} 2025-12-04T09:54:17.4166505Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4166554Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4166605Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4166846Z triton_mm_2237 0.0056 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4167079Z triton_mm_2232 0.0058 ms 97.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4167324Z triton_mm_2233 0.0060 ms 94.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4167561Z triton_mm_2236 0.0061 ms 92.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4167795Z triton_mm_2235 0.0072 ms 77.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4168032Z triton_mm_2234 0.0074 ms 75.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4168161Z SingleProcess AUTOTUNE benchmarking takes 0.0415 seconds and 0.0753 seconds precompiling for 6 choices 2025-12-04T09:54:17.4168202Z Autotune Choices Stats: 2025-12-04T09:54:17.4168570Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2240", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4168625Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4168663Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4168708Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4168948Z triton_mm_2240 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4169182Z triton_mm_2243 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4169416Z triton_mm_2242 0.0065 ms 88.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4169651Z triton_mm_2241 0.0066 ms 87.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4169884Z triton_mm_2239 0.0155 ms 37.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4170130Z triton_mm_2238 0.0163 ms 35.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4170264Z SingleProcess AUTOTUNE benchmarking takes 0.0524 seconds and 0.0771 seconds precompiling for 6 choices 2025-12-04T09:54:17.4170340Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4170383Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4170440Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4170540Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4171047Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4171085Z graph_break [] 2025-12-04T09:54:17.4171128Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4171204Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4171247Z Autotune Choices Stats: 2025-12-04T09:54:17.4171620Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2244", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005679999943822622, "best_triton_pos": 0} 2025-12-04T09:54:17.4171660Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4171698Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4171744Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4171986Z triton_mm_2244 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4172234Z triton_mm_2246 0.0058 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4172468Z triton_mm_2245 0.0058 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4172701Z triton_mm_2247 0.0059 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4172936Z triton_mm_2248 0.0060 ms 94.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4173172Z triton_mm_2249 0.0061 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4173299Z SingleProcess AUTOTUNE benchmarking takes 0.0410 seconds and 0.0742 seconds precompiling for 6 choices 2025-12-04T09:54:17.4173340Z Autotune Choices Stats: 2025-12-04T09:54:17.4173710Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2250", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4173761Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4173798Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4173847Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4174096Z triton_mm_2250 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4174333Z triton_mm_2252 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4174565Z triton_mm_2251 0.0059 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4174808Z triton_mm_2254 0.0061 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4175045Z triton_mm_2255 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4175277Z triton_mm_2253 0.0062 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4175405Z SingleProcess AUTOTUNE benchmarking takes 0.0405 seconds and 0.0721 seconds precompiling for 6 choices 2025-12-04T09:54:17.4175447Z Autotune Choices Stats: 2025-12-04T09:54:17.4175818Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2258", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005640000104904175, "best_triton_pos": 0} 2025-12-04T09:54:17.4175869Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4175906Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4175981Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4176221Z triton_mm_2258 0.0056 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4176455Z triton_mm_2256 0.0058 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4176687Z triton_mm_2257 0.0058 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4176921Z triton_mm_2260 0.0059 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4177153Z triton_mm_2261 0.0061 ms 92.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4177387Z triton_mm_2259 0.0068 ms 82.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4177539Z SingleProcess AUTOTUNE benchmarking takes 0.0415 seconds and 0.0755 seconds precompiling for 6 choices 2025-12-04T09:54:17.4177617Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4177675Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4177735Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4177836Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4178337Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4178377Z graph_break [] 2025-12-04T09:54:17.4178433Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4178510Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4178551Z Autotune Choices Stats: 2025-12-04T09:54:17.4178921Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2264", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4178959Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4178998Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4179042Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4179284Z triton_mm_2264 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4179519Z triton_mm_2265 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4179771Z triton_mm_2266 0.0060 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4180003Z triton_mm_2267 0.0061 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4180237Z triton_mm_2262 0.0154 ms 38.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4180473Z triton_mm_2263 0.0159 ms 37.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4180603Z SingleProcess AUTOTUNE benchmarking takes 0.0524 seconds and 0.0791 seconds precompiling for 6 choices 2025-12-04T09:54:17.4180644Z Autotune Choices Stats: 2025-12-04T09:54:17.4181008Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2272", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4181061Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4181098Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4181148Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4181388Z triton_mm_2272 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4181636Z triton_mm_2271 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4181870Z triton_mm_2273 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4182112Z triton_mm_2268 0.0060 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4182344Z triton_mm_2269 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4182577Z triton_mm_2270 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4182704Z SingleProcess AUTOTUNE benchmarking takes 0.0409 seconds and 0.0739 seconds precompiling for 6 choices 2025-12-04T09:54:17.4182744Z Autotune Choices Stats: 2025-12-04T09:54:17.4183110Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2276", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4183149Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4183186Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4183244Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4183486Z triton_mm_2276 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4183720Z triton_mm_2275 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4183953Z triton_mm_2274 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4184189Z triton_mm_2278 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4184422Z triton_mm_2279 0.0063 ms 92.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4184656Z triton_mm_2277 0.0065 ms 89.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4184783Z SingleProcess AUTOTUNE benchmarking takes 0.0413 seconds and 0.0741 seconds precompiling for 6 choices 2025-12-04T09:54:17.4184871Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4184914Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4184972Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4185088Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4185585Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4185624Z graph_break [] 2025-12-04T09:54:17.4185668Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4185746Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4185796Z Autotune Choices Stats: 2025-12-04T09:54:17.4186182Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2280", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006479999981820583, "best_triton_pos": 0} 2025-12-04T09:54:17.4186223Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4186261Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4186306Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4186544Z triton_mm_2280 0.0065 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4186778Z triton_mm_2282 0.0066 ms 98.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4187012Z triton_mm_2285 0.0067 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4187272Z triton_mm_2281 0.0067 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4187507Z triton_mm_2284 0.0068 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4187742Z triton_mm_2283 0.0068 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4187868Z SingleProcess AUTOTUNE benchmarking takes 0.0430 seconds and 0.0788 seconds precompiling for 6 choices 2025-12-04T09:54:17.4187911Z Autotune Choices Stats: 2025-12-04T09:54:17.4188274Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2291", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006039000116288662, "best_triton_pos": 0} 2025-12-04T09:54:17.4188313Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4188354Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4188401Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4188640Z triton_mm_2291 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4188900Z triton_mm_2289 0.0061 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4189138Z triton_mm_2290 0.0066 ms 90.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4189372Z triton_mm_2288 0.0067 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4189618Z triton_mm_2287 0.0124 ms 48.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4189851Z triton_mm_2286 0.0164 ms 36.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4189981Z SingleProcess AUTOTUNE benchmarking takes 0.0563 seconds and 0.0781 seconds precompiling for 6 choices 2025-12-04T09:54:17.4190021Z Autotune Choices Stats: 2025-12-04T09:54:17.4190389Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2295", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.00571999978274107, "best_triton_pos": 0} 2025-12-04T09:54:17.4190431Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4190468Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4190513Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4190751Z triton_mm_2295 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4191001Z triton_mm_2297 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4191234Z triton_mm_2296 0.0060 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4191474Z triton_mm_2294 0.0096 ms 59.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4191710Z triton_mm_2292 0.0142 ms 40.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4191942Z triton_mm_2293 0.0169 ms 33.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4192067Z SingleProcess AUTOTUNE benchmarking takes 0.0555 seconds and 0.0815 seconds precompiling for 6 choices 2025-12-04T09:54:17.4192140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4192195Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4192253Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4192354Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4192859Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4192899Z graph_break [] 2025-12-04T09:54:17.4192941Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4193015Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4193055Z Autotune Choices Stats: 2025-12-04T09:54:17.4193435Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2299", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T09:54:17.4193478Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4193515Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4193560Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4193799Z triton_mm_2299 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4194032Z triton_mm_2298 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4194268Z triton_mm_2303 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4194504Z triton_mm_2300 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4194745Z triton_mm_2302 0.0064 ms 91.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4194978Z triton_mm_2301 0.0065 ms 90.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4195106Z SingleProcess AUTOTUNE benchmarking takes 0.0406 seconds and 0.0799 seconds precompiling for 6 choices 2025-12-04T09:54:17.4195147Z Autotune Choices Stats: 2025-12-04T09:54:17.4195523Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2305", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4195562Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4195601Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4195648Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4195886Z triton_mm_2305 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4196159Z triton_mm_2307 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4196408Z triton_mm_2306 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4196642Z triton_mm_2308 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4196877Z triton_mm_2309 0.0060 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4197125Z triton_mm_2304 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4197253Z SingleProcess AUTOTUNE benchmarking takes 0.0410 seconds and 0.0740 seconds precompiling for 6 choices 2025-12-04T09:54:17.4197295Z Autotune Choices Stats: 2025-12-04T09:54:17.4197663Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2312", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4197702Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4197738Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4197785Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4198023Z triton_mm_2312 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4198263Z triton_mm_2314 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4198509Z triton_mm_2313 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4198739Z triton_mm_2311 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4198974Z triton_mm_2315 0.0064 ms 92.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4199206Z triton_mm_2310 0.0066 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4199336Z SingleProcess AUTOTUNE benchmarking takes 0.0394 seconds and 0.0873 seconds precompiling for 6 choices 2025-12-04T09:54:17.4199410Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4199452Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4199509Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4199611Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4200131Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4200170Z graph_break [] 2025-12-04T09:54:17.4200215Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4200288Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4200328Z Autotune Choices Stats: 2025-12-04T09:54:17.4200705Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2317", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005679999943822622, "best_triton_pos": 0} 2025-12-04T09:54:17.4200745Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4200781Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4200829Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4201068Z triton_mm_2317 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4201303Z triton_mm_2320 0.0060 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4201541Z triton_mm_2316 0.0060 ms 94.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4201779Z triton_mm_2321 0.0061 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4202026Z triton_mm_2318 0.0062 ms 91.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4202258Z triton_mm_2319 0.0064 ms 88.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4202385Z SingleProcess AUTOTUNE benchmarking takes 0.0370 seconds and 0.0758 seconds precompiling for 6 choices 2025-12-04T09:54:17.4202425Z Autotune Choices Stats: 2025-12-04T09:54:17.4202793Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2326", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4202835Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4202874Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4202921Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4203160Z triton_mm_2326 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4203398Z triton_mm_2327 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4203643Z triton_mm_2323 0.0059 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4203887Z triton_mm_2325 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4204118Z triton_mm_2324 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4204367Z triton_mm_2322 0.0134 ms 43.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4204493Z SingleProcess AUTOTUNE benchmarking takes 0.0469 seconds and 0.0793 seconds precompiling for 6 choices 2025-12-04T09:54:17.4204534Z Autotune Choices Stats: 2025-12-04T09:54:17.4204903Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2330", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4204944Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4204980Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4205025Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4205263Z triton_mm_2330 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4205504Z triton_mm_2331 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4205748Z triton_mm_2333 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4206011Z triton_mm_2332 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4206247Z triton_mm_2329 0.0066 ms 90.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4206479Z triton_mm_2328 0.0092 ms 65.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4206607Z SingleProcess AUTOTUNE benchmarking takes 0.0382 seconds and 0.0553 seconds precompiling for 6 choices 2025-12-04T09:54:17.4206680Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4206722Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4206781Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4206881Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4207381Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4207434Z graph_break [] 2025-12-04T09:54:17.4207491Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4207566Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4207607Z Autotune Choices Stats: 2025-12-04T09:54:17.4207973Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2336", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4208016Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4208053Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4208113Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4208350Z triton_mm_2336 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4208586Z triton_mm_2337 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4208820Z triton_mm_2338 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4209055Z triton_mm_2339 0.0062 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4209290Z triton_mm_2335 0.0080 ms 72.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4209539Z triton_mm_2334 0.0143 ms 40.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4209671Z SingleProcess AUTOTUNE benchmarking takes 0.0461 seconds and 0.0748 seconds precompiling for 6 choices 2025-12-04T09:54:17.4209711Z Autotune Choices Stats: 2025-12-04T09:54:17.4210081Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2343", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4210121Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4210159Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4210208Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4210445Z triton_mm_2343 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4210680Z triton_mm_2344 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4210913Z triton_mm_2345 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4211169Z triton_mm_2340 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4211404Z triton_mm_2341 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4211642Z triton_mm_2342 0.0064 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4211767Z SingleProcess AUTOTUNE benchmarking takes 0.0380 seconds and 0.0791 seconds precompiling for 6 choices 2025-12-04T09:54:17.4211819Z Autotune Choices Stats: 2025-12-04T09:54:17.4212183Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2351", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4212222Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4212262Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4212307Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4212548Z triton_mm_2351 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4212782Z triton_mm_2348 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4213018Z triton_mm_2350 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4213265Z triton_mm_2349 0.0060 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4213503Z triton_mm_2347 0.0064 ms 91.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4213737Z triton_mm_2346 0.0065 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4213865Z SingleProcess AUTOTUNE benchmarking takes 0.0380 seconds and 0.0579 seconds precompiling for 6 choices 2025-12-04T09:54:17.4213941Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4213983Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4214041Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4214142Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4214640Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4214687Z graph_break [] 2025-12-04T09:54:17.4214733Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4214807Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4214857Z Autotune Choices Stats: 2025-12-04T09:54:17.4215223Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2352", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4215263Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4215300Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4215347Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4215599Z triton_mm_2352 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4215841Z triton_mm_2356 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4216107Z triton_mm_2353 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4216339Z triton_mm_2354 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4216573Z triton_mm_2355 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4216808Z triton_mm_2357 0.0063 ms 94.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4216954Z SingleProcess AUTOTUNE benchmarking takes 0.0372 seconds and 0.0746 seconds precompiling for 6 choices 2025-12-04T09:54:17.4216993Z Autotune Choices Stats: 2025-12-04T09:54:17.4217359Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2358", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4217401Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4217439Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4217487Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4217727Z triton_mm_2358 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4217963Z triton_mm_2360 0.0060 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4218195Z triton_mm_2363 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4218444Z triton_mm_2362 0.0060 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4218695Z triton_mm_2361 0.0063 ms 91.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4218927Z triton_mm_2359 0.0079 ms 73.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4219056Z SingleProcess AUTOTUNE benchmarking takes 0.0376 seconds and 0.0759 seconds precompiling for 6 choices 2025-12-04T09:54:17.4219097Z Autotune Choices Stats: 2025-12-04T09:54:17.4219477Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2364", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4219517Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4219555Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4219599Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4219839Z triton_mm_2364 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4220071Z triton_mm_2368 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4220305Z triton_mm_2369 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4220537Z triton_mm_2366 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4220779Z triton_mm_2365 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4221011Z triton_mm_2367 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4221138Z SingleProcess AUTOTUNE benchmarking takes 0.0381 seconds and 0.0780 seconds precompiling for 6 choices 2025-12-04T09:54:17.4221214Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4221256Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4221317Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4221418Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4221921Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4221972Z graph_break [] 2025-12-04T09:54:17.4222015Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4222090Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4222133Z Autotune Choices Stats: 2025-12-04T09:54:17.4222515Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2370", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4222556Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4222594Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4222638Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4222878Z triton_mm_2370 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4223123Z triton_mm_2371 0.0067 ms 88.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4223357Z triton_mm_2372 0.0067 ms 88.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4223590Z triton_mm_2373 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4223828Z triton_mm_2375 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4224064Z triton_mm_2374 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4224201Z SingleProcess AUTOTUNE benchmarking takes 0.0414 seconds and 0.0794 seconds precompiling for 6 choices 2025-12-04T09:54:17.4224244Z Autotune Choices Stats: 2025-12-04T09:54:17.4224609Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2380", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.0061599998734891415, "best_triton_pos": 0} 2025-12-04T09:54:17.4224649Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4224687Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4224735Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4224973Z triton_mm_2380 0.0062 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4225213Z triton_mm_2376 0.0062 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4225444Z triton_mm_2379 0.0066 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4225678Z triton_mm_2377 0.0074 ms 83.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4225955Z triton_mm_2381 0.0074 ms 82.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4226203Z triton_mm_2378 0.0080 ms 76.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4226331Z SingleProcess AUTOTUNE benchmarking takes 0.5179 seconds and 0.0759 seconds precompiling for 6 choices 2025-12-04T09:54:17.4226370Z Autotune Choices Stats: 2025-12-04T09:54:17.4226749Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2382", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006200000178068876, "best_triton_pos": 0} 2025-12-04T09:54:17.4226789Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4226827Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4226871Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4227112Z triton_mm_2382 0.0062 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4227347Z triton_mm_2385 0.0066 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4227582Z triton_mm_2384 0.0067 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4227818Z triton_mm_2386 0.0068 ms 91.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4228064Z triton_mm_2387 0.0069 ms 89.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4228297Z triton_mm_2383 0.0072 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4228422Z SingleProcess AUTOTUNE benchmarking takes 0.0403 seconds and 0.0771 seconds precompiling for 6 choices 2025-12-04T09:54:17.4228497Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4228539Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4228601Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4228702Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4229201Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4229245Z graph_break [] 2025-12-04T09:54:17.4229287Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4229363Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4229417Z Autotune Choices Stats: 2025-12-04T09:54:17.4229803Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2392", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.0061599998734891415, "best_triton_pos": 0} 2025-12-04T09:54:17.4229843Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4229881Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4229925Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4230162Z triton_mm_2392 0.0062 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4230395Z triton_mm_2390 0.0067 ms 91.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4230642Z triton_mm_2393 0.0067 ms 91.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4230879Z triton_mm_2391 0.0068 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4231112Z triton_mm_2389 0.0068 ms 90.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4231346Z triton_mm_2388 0.0071 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4231472Z SingleProcess AUTOTUNE benchmarking takes 0.0414 seconds and 0.0737 seconds precompiling for 6 choices 2025-12-04T09:54:17.4231514Z Autotune Choices Stats: 2025-12-04T09:54:17.4231892Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2399", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005998999811708927, "best_triton_pos": 0} 2025-12-04T09:54:17.4231933Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4231970Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4232018Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4232256Z triton_mm_2399 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4232489Z triton_mm_2394 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4232724Z triton_mm_2396 0.0062 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4232954Z triton_mm_2395 0.0067 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4233189Z triton_mm_2398 0.0067 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4233448Z triton_mm_2397 0.0069 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4233577Z SingleProcess AUTOTUNE benchmarking takes 0.0406 seconds and 0.0782 seconds precompiling for 6 choices 2025-12-04T09:54:17.4233618Z Autotune Choices Stats: 2025-12-04T09:54:17.4233985Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2403", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4234025Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4234062Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4234119Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4234356Z triton_mm_2403 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4234594Z triton_mm_2405 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4234825Z triton_mm_2400 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4235058Z triton_mm_2404 0.0066 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4235291Z triton_mm_2402 0.0067 ms 89.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4235534Z triton_mm_2401 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4235660Z SingleProcess AUTOTUNE benchmarking takes 0.0423 seconds and 0.0782 seconds precompiling for 6 choices 2025-12-04T09:54:17.4235733Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4235778Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4235837Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4235966Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4236464Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4236504Z graph_break [] 2025-12-04T09:54:17.4236548Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4236623Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4236663Z Autotune Choices Stats: 2025-12-04T09:54:17.4237032Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2409", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4237084Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4237125Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4238806Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4239051Z triton_mm_2409 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4239289Z triton_mm_2411 0.0064 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4239540Z triton_mm_2408 0.0067 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4239776Z triton_mm_2410 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4240009Z triton_mm_2406 0.0157 ms 37.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4240241Z triton_mm_2407 0.0168 ms 35.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4240367Z SingleProcess AUTOTUNE benchmarking takes 0.0531 seconds and 0.0771 seconds precompiling for 6 choices 2025-12-04T09:54:17.4240410Z Autotune Choices Stats: 2025-12-04T09:54:17.4240774Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2417", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006279999855905771, "best_triton_pos": 0} 2025-12-04T09:54:17.4240830Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4240868Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4240916Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4241154Z triton_mm_2417 0.0063 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4241393Z triton_mm_2416 0.0066 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4241629Z triton_mm_2413 0.0067 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4241862Z triton_mm_2414 0.0067 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4242095Z triton_mm_2412 0.0071 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4242340Z triton_mm_2415 0.0074 ms 84.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4242469Z SingleProcess AUTOTUNE benchmarking takes 0.0436 seconds and 0.0764 seconds precompiling for 6 choices 2025-12-04T09:54:17.4242521Z Autotune Choices Stats: 2025-12-04T09:54:17.4242886Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2422", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.00595899997279048, "best_triton_pos": 0} 2025-12-04T09:54:17.4242926Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4242962Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4243009Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4243253Z triton_mm_2422 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4243501Z triton_mm_2423 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4243737Z triton_mm_2421 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4243969Z triton_mm_2418 0.0109 ms 54.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4244203Z triton_mm_2420 0.0153 ms 38.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4244437Z triton_mm_2419 0.0170 ms 35.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4244577Z SingleProcess AUTOTUNE benchmarking takes 0.0526 seconds and 0.0826 seconds precompiling for 6 choices 2025-12-04T09:54:17.4244649Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4244691Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4244747Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4244849Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4245353Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4245395Z graph_break [] 2025-12-04T09:54:17.4245437Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4245510Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4245551Z Autotune Choices Stats: 2025-12-04T09:54:17.4245919Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2424", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005919000133872032, "best_triton_pos": 0} 2025-12-04T09:54:17.4246002Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4246039Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4246085Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4246334Z triton_mm_2424 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4246572Z triton_mm_2427 0.0067 ms 88.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4246805Z triton_mm_2426 0.0067 ms 88.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4247057Z triton_mm_2425 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4247290Z triton_mm_2429 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4247528Z triton_mm_2428 0.0069 ms 86.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4247657Z SingleProcess AUTOTUNE benchmarking takes 0.0408 seconds and 0.0719 seconds precompiling for 6 choices 2025-12-04T09:54:17.4247696Z Autotune Choices Stats: 2025-12-04T09:54:17.4248068Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2433", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4248126Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4248164Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4248211Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4248447Z triton_mm_2433 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4248681Z triton_mm_2435 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4248913Z triton_mm_2430 0.0063 ms 94.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4249146Z triton_mm_2431 0.0067 ms 88.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4249383Z triton_mm_2434 0.0067 ms 88.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4249620Z triton_mm_2432 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4249755Z SingleProcess AUTOTUNE benchmarking takes 0.0417 seconds and 0.0631 seconds precompiling for 6 choices 2025-12-04T09:54:17.4249796Z Autotune Choices Stats: 2025-12-04T09:54:17.4250173Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2439", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T09:54:17.4250215Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4250253Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4250298Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4250537Z triton_mm_2439 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4250783Z triton_mm_2436 0.0064 ms 93.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4251019Z triton_mm_2441 0.0067 ms 89.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4251253Z triton_mm_2438 0.0068 ms 88.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4251492Z triton_mm_2440 0.0068 ms 88.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4251725Z triton_mm_2437 0.0070 ms 85.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4251851Z SingleProcess AUTOTUNE benchmarking takes 0.0416 seconds and 0.0744 seconds precompiling for 6 choices 2025-12-04T09:54:17.4251938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4251980Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4252036Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4252138Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4252640Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4252678Z graph_break [] 2025-12-04T09:54:17.4252723Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4252800Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4252841Z Autotune Choices Stats: 2025-12-04T09:54:17.4253206Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2446", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4253247Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4253283Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4253339Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4253585Z triton_mm_2446 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4253833Z triton_mm_2445 0.0066 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4254068Z triton_mm_2447 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4254302Z triton_mm_2444 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4254549Z triton_mm_2443 0.0118 ms 50.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4254784Z triton_mm_2442 0.0142 ms 42.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4254912Z SingleProcess AUTOTUNE benchmarking takes 0.0546 seconds and 0.0777 seconds precompiling for 6 choices 2025-12-04T09:54:17.4254952Z Autotune Choices Stats: 2025-12-04T09:54:17.4255321Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2451", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006560000125318766, "best_triton_pos": 0} 2025-12-04T09:54:17.4255361Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4255400Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4255446Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4255702Z triton_mm_2451 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4255961Z triton_mm_2450 0.0067 ms 98.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4256197Z triton_mm_2453 0.0067 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4256435Z triton_mm_2452 0.0068 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4256668Z triton_mm_2449 0.0068 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4256902Z triton_mm_2448 0.0118 ms 55.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4257027Z SingleProcess AUTOTUNE benchmarking takes 0.0454 seconds and 0.0716 seconds precompiling for 6 choices 2025-12-04T09:54:17.4257081Z Autotune Choices Stats: 2025-12-04T09:54:17.4257465Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2458", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4257506Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4257543Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4257590Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4257829Z triton_mm_2458 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4258061Z triton_mm_2459 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4258308Z triton_mm_2456 0.0062 ms 94.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4258544Z triton_mm_2457 0.0067 ms 88.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4258780Z triton_mm_2455 0.0098 ms 60.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4259010Z triton_mm_2454 0.0156 ms 37.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4259138Z SingleProcess AUTOTUNE benchmarking takes 0.0651 seconds and 0.0765 seconds precompiling for 6 choices 2025-12-04T09:54:17.4259213Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4259267Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4259325Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4259427Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4259928Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4259966Z graph_break [] 2025-12-04T09:54:17.4260010Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4260084Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4260124Z Autotune Choices Stats: 2025-12-04T09:54:17.4260492Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2462", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4260530Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4260568Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4260617Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4260856Z triton_mm_2462 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4261100Z triton_mm_2464 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4261347Z triton_mm_2465 0.0068 ms 87.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4261582Z triton_mm_2463 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4261830Z triton_mm_2461 0.0144 ms 41.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4262063Z triton_mm_2460 0.0165 ms 36.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4262193Z SingleProcess AUTOTUNE benchmarking takes 0.0465 seconds and 0.0740 seconds precompiling for 6 choices 2025-12-04T09:54:17.4262233Z Autotune Choices Stats: 2025-12-04T09:54:17.4262601Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2471", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4262640Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4262677Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4262725Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4262964Z triton_mm_2471 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4263208Z triton_mm_2468 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4263439Z triton_mm_2466 0.0060 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4263679Z triton_mm_2467 0.0067 ms 88.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4263912Z triton_mm_2469 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4264148Z triton_mm_2470 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4264278Z SingleProcess AUTOTUNE benchmarking takes 0.0371 seconds and 0.0789 seconds precompiling for 6 choices 2025-12-04T09:54:17.4264319Z Autotune Choices Stats: 2025-12-04T09:54:17.4264689Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2474", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4264739Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4264778Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4264835Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4265073Z triton_mm_2474 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4265307Z triton_mm_2476 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4265555Z triton_mm_2472 0.0062 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4265786Z triton_mm_2473 0.0064 ms 92.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4266049Z triton_mm_2475 0.0066 ms 90.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4266286Z triton_mm_2477 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4266412Z SingleProcess AUTOTUNE benchmarking takes 0.0378 seconds and 0.0777 seconds precompiling for 6 choices 2025-12-04T09:54:17.4266490Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4266531Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4266588Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4266702Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4267198Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4267236Z graph_break [] 2025-12-04T09:54:17.4267283Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4267358Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4267402Z Autotune Choices Stats: 2025-12-04T09:54:17.4267774Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2480", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4267815Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4267854Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4267899Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4268142Z triton_mm_2480 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4268375Z triton_mm_2478 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4268631Z triton_mm_2483 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4268865Z triton_mm_2481 0.0065 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4269101Z triton_mm_2479 0.0067 ms 88.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4269347Z triton_mm_2482 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4269475Z SingleProcess AUTOTUNE benchmarking takes 0.0363 seconds and 0.0757 seconds precompiling for 6 choices 2025-12-04T09:54:17.4269519Z Autotune Choices Stats: 2025-12-04T09:54:17.4269889Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2484", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006039999891072512, "best_triton_pos": 0} 2025-12-04T09:54:17.4269928Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4269964Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4270013Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4270250Z triton_mm_2484 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4270483Z triton_mm_2486 0.0062 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4270726Z triton_mm_2488 0.0062 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4270960Z triton_mm_2487 0.0067 ms 89.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4271194Z triton_mm_2489 0.0068 ms 88.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4271428Z triton_mm_2485 0.0134 ms 45.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4271556Z SingleProcess AUTOTUNE benchmarking takes 0.0480 seconds and 0.0755 seconds precompiling for 6 choices 2025-12-04T09:54:17.4271597Z Autotune Choices Stats: 2025-12-04T09:54:17.4271967Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2493", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.0062790000811219215, "best_triton_pos": 0} 2025-12-04T09:54:17.4272015Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4272057Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4272103Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4272350Z triton_mm_2493 0.0063 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4272583Z triton_mm_2495 0.0068 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4272815Z triton_mm_2492 0.0068 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4273060Z triton_mm_2494 0.0068 ms 91.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4273291Z triton_mm_2491 0.0124 ms 50.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4273529Z triton_mm_2490 0.0156 ms 40.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4273656Z SingleProcess AUTOTUNE benchmarking takes 0.0512 seconds and 0.0803 seconds precompiling for 6 choices 2025-12-04T09:54:17.4273731Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4273773Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4273830Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4273931Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4274430Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4274481Z graph_break [] 2025-12-04T09:54:17.4274523Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4274599Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4274638Z Autotune Choices Stats: 2025-12-04T09:54:17.4275007Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2498", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4275046Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4275086Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4275130Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4275368Z triton_mm_2498 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4275604Z triton_mm_2496 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4275852Z triton_mm_2497 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4276140Z triton_mm_2499 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4276374Z triton_mm_2500 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4276607Z triton_mm_2501 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4276751Z SingleProcess AUTOTUNE benchmarking takes 0.0366 seconds and 0.0739 seconds precompiling for 6 choices 2025-12-04T09:54:17.4276792Z Autotune Choices Stats: 2025-12-04T09:54:17.4277161Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2507", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4277202Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4277240Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4279112Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4279356Z triton_mm_2507 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4279610Z triton_mm_2504 0.0065 ms 91.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4279845Z triton_mm_2505 0.0067 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4280108Z triton_mm_2506 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4280341Z triton_mm_2502 0.0122 ms 48.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4280574Z triton_mm_2503 0.0160 ms 37.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4280704Z SingleProcess AUTOTUNE benchmarking takes 0.0479 seconds and 0.0814 seconds precompiling for 6 choices 2025-12-04T09:54:17.4280747Z Autotune Choices Stats: 2025-12-04T09:54:17.4281113Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2509", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006039999891072512, "best_triton_pos": 0} 2025-12-04T09:54:17.4281151Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4281189Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4281249Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4281493Z triton_mm_2509 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4281737Z triton_mm_2513 0.0062 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4281972Z triton_mm_2510 0.0062 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4282206Z triton_mm_2508 0.0063 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4282449Z triton_mm_2512 0.0067 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4282684Z triton_mm_2511 0.0068 ms 88.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4282809Z SingleProcess AUTOTUNE benchmarking takes 0.0710 seconds and 0.0810 seconds precompiling for 6 choices 2025-12-04T09:54:17.4282885Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4282928Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4282986Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4283087Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4283623Z inductor [('triton_bundler_save_kernel', 160), ('benchmarking.InductorBenchmarker.benchmark_gpu', 22), ('benchmarking.InductorBenchmarker.benchmark', 14), ('async_compile_cache_miss', 12), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4283673Z graph_break [] 2025-12-04T09:54:17.4283717Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4283791Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4283830Z Autotune Choices Stats: 2025-12-04T09:54:17.4284201Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2518", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005679999943822622, "best_triton_pos": 0} 2025-12-04T09:54:17.4284242Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4284280Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4284327Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4284568Z triton_mm_2518 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4284801Z triton_mm_2519 0.0058 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4285037Z triton_mm_2514 0.0061 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4285279Z triton_mm_2517 0.0061 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4285528Z triton_mm_2515 0.0062 ms 91.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4285763Z triton_mm_2516 0.0068 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4285889Z SingleProcess AUTOTUNE benchmarking takes 0.0424 seconds and 0.0719 seconds precompiling for 6 choices 2025-12-04T09:54:17.4285961Z Autotune Choices Stats: 2025-12-04T09:54:17.4286343Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2523", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006639999803155661, "best_triton_pos": 0} 2025-12-04T09:54:17.4286385Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4286422Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4286472Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4286709Z triton_mm_2523 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4286945Z triton_mm_2525 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4287182Z triton_mm_2524 0.0068 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4287426Z triton_mm_2521 0.0152 ms 43.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4287662Z triton_mm_2522 0.0167 ms 39.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4287896Z triton_mm_2520 0.0170 ms 39.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4288024Z SingleProcess AUTOTUNE benchmarking takes 0.0544 seconds and 0.0849 seconds precompiling for 6 choices 2025-12-04T09:54:17.4288063Z Autotune Choices Stats: 2025-12-04T09:54:17.4288432Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2526", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4288472Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4288508Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4288554Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4288791Z triton_mm_2526 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4289050Z triton_mm_2527 0.0076 ms 77.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4289282Z triton_mm_2528 0.0077 ms 77.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4289515Z triton_mm_2529 0.0084 ms 70.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4289760Z triton_mm_2530 0.0085 ms 69.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4289995Z triton_mm_2531 0.0099 ms 59.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4290124Z SingleProcess AUTOTUNE benchmarking takes 0.5292 seconds and 0.0776 seconds precompiling for 6 choices 2025-12-04T09:54:17.4290198Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4290241Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4290297Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4290400Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4290901Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4290954Z graph_break [] 2025-12-04T09:54:17.4290996Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4291071Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4291110Z Autotune Choices Stats: 2025-12-04T09:54:17.4291478Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2537", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006639999803155661, "best_triton_pos": 0} 2025-12-04T09:54:17.4291518Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4291557Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4291604Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4291845Z triton_mm_2537 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4292081Z triton_mm_2536 0.0070 ms 94.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4292316Z triton_mm_2534 0.0072 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4292549Z triton_mm_2535 0.0073 ms 91.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4292814Z triton_mm_2532 0.0074 ms 90.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4293050Z triton_mm_2533 0.0074 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4293176Z SingleProcess AUTOTUNE benchmarking takes 0.0472 seconds and 0.0772 seconds precompiling for 6 choices 2025-12-04T09:54:17.4293219Z Autotune Choices Stats: 2025-12-04T09:54:17.4293595Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2541", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006519999820739031, "best_triton_pos": 0} 2025-12-04T09:54:17.4293636Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4293675Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4293723Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4293962Z triton_mm_2541 0.0065 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4294196Z triton_mm_2542 0.0067 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4294429Z triton_mm_2539 0.0067 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4294662Z triton_mm_2540 0.0067 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4294910Z triton_mm_2543 0.0068 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4295142Z triton_mm_2538 0.0073 ms 89.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4295268Z SingleProcess AUTOTUNE benchmarking takes 0.0449 seconds and 0.0773 seconds precompiling for 6 choices 2025-12-04T09:54:17.4295310Z Autotune Choices Stats: 2025-12-04T09:54:17.4295679Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2546", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T09:54:17.4295720Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4295758Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4295804Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4296058Z triton_mm_2546 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4296309Z triton_mm_2548 0.0066 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4296555Z triton_mm_2549 0.0067 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4296790Z triton_mm_2547 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4297025Z triton_mm_2545 0.0117 ms 51.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4297268Z triton_mm_2544 0.0168 ms 35.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4297395Z SingleProcess AUTOTUNE benchmarking takes 0.0586 seconds and 0.0835 seconds precompiling for 6 choices 2025-12-04T09:54:17.4297470Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4297514Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4297571Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4297672Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4298171Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4298211Z graph_break [] 2025-12-04T09:54:17.4298253Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4298329Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4298380Z Autotune Choices Stats: 2025-12-04T09:54:17.4298752Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2550", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4298793Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4298829Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4298874Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4299113Z triton_mm_2550 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4299348Z triton_mm_2552 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4299582Z triton_mm_2555 0.0061 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4299817Z triton_mm_2553 0.0065 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4300061Z triton_mm_2554 0.0067 ms 88.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4300306Z triton_mm_2551 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4300435Z SingleProcess AUTOTUNE benchmarking takes 0.0409 seconds and 0.0731 seconds precompiling for 6 choices 2025-12-04T09:54:17.4300475Z Autotune Choices Stats: 2025-12-04T09:54:17.4300842Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2560", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4300890Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4300929Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4300977Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4301217Z triton_mm_2560 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4301450Z triton_mm_2559 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4301686Z triton_mm_2561 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4301927Z triton_mm_2558 0.0068 ms 87.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4302161Z triton_mm_2557 0.0166 ms 35.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4302406Z triton_mm_2556 0.0178 ms 33.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4302533Z SingleProcess AUTOTUNE benchmarking takes 0.0554 seconds and 0.0797 seconds precompiling for 6 choices 2025-12-04T09:54:17.4302573Z Autotune Choices Stats: 2025-12-04T09:54:17.4302940Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2566", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005998999811708927, "best_triton_pos": 0} 2025-12-04T09:54:17.4302981Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4303018Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4303064Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4303303Z triton_mm_2566 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4303540Z triton_mm_2565 0.0061 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4303786Z triton_mm_2567 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4304028Z triton_mm_2564 0.0068 ms 87.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4304261Z triton_mm_2562 0.0094 ms 64.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4304492Z triton_mm_2563 0.0132 ms 45.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4304630Z SingleProcess AUTOTUNE benchmarking takes 0.0566 seconds and 0.0794 seconds precompiling for 6 choices 2025-12-04T09:54:17.4304702Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4304745Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4304803Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4304905Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4305402Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4305441Z graph_break [] 2025-12-04T09:54:17.4305484Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4305558Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4305598Z Autotune Choices Stats: 2025-12-04T09:54:17.4305992Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2573", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4306060Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4306097Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4306142Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4306380Z triton_mm_2573 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4306614Z triton_mm_2571 0.0066 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4306850Z triton_mm_2572 0.0066 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4307083Z triton_mm_2570 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4307317Z triton_mm_2568 0.0127 ms 47.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4307563Z triton_mm_2569 0.0163 ms 36.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4307706Z SingleProcess AUTOTUNE benchmarking takes 0.0554 seconds and 0.0784 seconds precompiling for 6 choices 2025-12-04T09:54:17.4307746Z Autotune Choices Stats: 2025-12-04T09:54:17.4308117Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2577", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006560000125318766, "best_triton_pos": 0} 2025-12-04T09:54:17.4308155Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4308194Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4308241Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4308491Z triton_mm_2577 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4308723Z triton_mm_2576 0.0066 ms 98.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4308956Z triton_mm_2578 0.0067 ms 98.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4309189Z triton_mm_2579 0.0068 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4309423Z triton_mm_2575 0.0139 ms 47.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4309673Z triton_mm_2574 0.0160 ms 41.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4309800Z SingleProcess AUTOTUNE benchmarking takes 0.0562 seconds and 0.0825 seconds precompiling for 6 choices 2025-12-04T09:54:17.4309840Z Autotune Choices Stats: 2025-12-04T09:54:17.4310206Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2584", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006519999820739031, "best_triton_pos": 0} 2025-12-04T09:54:17.4310246Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4310282Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4310327Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4310567Z triton_mm_2584 0.0065 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4310800Z triton_mm_2583 0.0067 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4311032Z triton_mm_2581 0.0067 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4311275Z triton_mm_2582 0.0067 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4311519Z triton_mm_2585 0.0068 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4311750Z triton_mm_2580 0.0072 ms 90.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4311876Z SingleProcess AUTOTUNE benchmarking takes 0.0437 seconds and 0.0791 seconds precompiling for 6 choices 2025-12-04T09:54:17.4311951Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4312002Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4312061Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4312161Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4312663Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4312700Z graph_break [] 2025-12-04T09:54:17.4312743Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4312816Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4312857Z Autotune Choices Stats: 2025-12-04T09:54:17.4313228Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2587", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.00571999978274107, "best_triton_pos": 0} 2025-12-04T09:54:17.4313280Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4313317Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4313362Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4313598Z triton_mm_2587 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4313838Z triton_mm_2586 0.0058 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4314074Z triton_mm_2591 0.0059 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4314306Z triton_mm_2588 0.0060 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4314538Z triton_mm_2589 0.0063 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4314773Z triton_mm_2590 0.0073 ms 78.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4314909Z SingleProcess AUTOTUNE benchmarking takes 0.0432 seconds and 0.0593 seconds precompiling for 6 choices 2025-12-04T09:54:17.4314949Z Autotune Choices Stats: 2025-12-04T09:54:17.4315322Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2597", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.00571900000795722, "best_triton_pos": 0} 2025-12-04T09:54:17.4315361Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4315399Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4315446Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4315695Z triton_mm_2597 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4315963Z triton_mm_2592 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4316196Z triton_mm_2596 0.0058 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4316429Z triton_mm_2595 0.0058 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4316661Z triton_mm_2594 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4316895Z triton_mm_2593 0.0064 ms 89.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4317034Z SingleProcess AUTOTUNE benchmarking takes 0.0415 seconds and 0.0716 seconds precompiling for 6 choices 2025-12-04T09:54:17.4317074Z Autotune Choices Stats: 2025-12-04T09:54:17.4317441Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2602", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.0063599999994039536, "best_triton_pos": 0} 2025-12-04T09:54:17.4317480Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4317517Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4317563Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4317803Z triton_mm_2602 0.0064 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4318039Z triton_mm_2603 0.0067 ms 95.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4318272Z triton_mm_2600 0.0067 ms 94.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4318504Z triton_mm_2601 0.0067 ms 94.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4318762Z triton_mm_2599 0.0070 ms 90.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4318994Z triton_mm_2598 0.0072 ms 88.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4319120Z SingleProcess AUTOTUNE benchmarking takes 0.0444 seconds and 0.0824 seconds precompiling for 6 choices 2025-12-04T09:54:17.4319193Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4319235Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4319293Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4319413Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4319910Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4319947Z graph_break [] 2025-12-04T09:54:17.4319991Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4320064Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4320105Z Autotune Choices Stats: 2025-12-04T09:54:17.4320474Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2608", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4320514Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4320563Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4320610Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4320849Z triton_mm_2608 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4321082Z triton_mm_2604 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4321318Z triton_mm_2607 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4321552Z triton_mm_2605 0.0062 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4321789Z triton_mm_2609 0.0069 ms 83.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4322021Z triton_mm_2606 0.0075 ms 77.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4322160Z SingleProcess AUTOTUNE benchmarking takes 0.0463 seconds and 0.0805 seconds precompiling for 6 choices 2025-12-04T09:54:17.4322200Z Autotune Choices Stats: 2025-12-04T09:54:17.4322575Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2614", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.006039000116288662, "best_triton_pos": 0} 2025-12-04T09:54:17.4322615Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4322652Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4322700Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4322935Z triton_mm_2614 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4323180Z triton_mm_2611 0.0061 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4323412Z triton_mm_2615 0.0061 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4323647Z triton_mm_2613 0.0062 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4323881Z triton_mm_2612 0.0065 ms 92.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4324114Z triton_mm_2610 0.0084 ms 72.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4324239Z SingleProcess AUTOTUNE benchmarking takes 0.0389 seconds and 0.0828 seconds precompiling for 6 choices 2025-12-04T09:54:17.4324294Z Autotune Choices Stats: 2025-12-04T09:54:17.4324657Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2618", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005679999943822622, "best_triton_pos": 0} 2025-12-04T09:54:17.4324695Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4324733Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4324777Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4325015Z triton_mm_2618 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4325247Z triton_mm_2619 0.0058 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4325482Z triton_mm_2621 0.0062 ms 92.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4325717Z triton_mm_2620 0.0064 ms 89.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4325989Z triton_mm_2617 0.0144 ms 39.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4326234Z triton_mm_2616 0.0156 ms 36.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4326361Z SingleProcess AUTOTUNE benchmarking takes 0.0549 seconds and 0.0827 seconds precompiling for 6 choices 2025-12-04T09:54:17.4326433Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4326475Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4326535Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4326635Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4327148Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4327186Z graph_break [] 2025-12-04T09:54:17.4327228Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4327302Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4327342Z Autotune Choices Stats: 2025-12-04T09:54:17.4327708Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2623", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4327748Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4327786Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4327831Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4328072Z triton_mm_2623 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4328319Z triton_mm_2624 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4328550Z triton_mm_2625 0.0060 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4328786Z triton_mm_2626 0.0060 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4329020Z triton_mm_2622 0.0061 ms 94.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4329253Z triton_mm_2627 0.0062 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4329378Z SingleProcess AUTOTUNE benchmarking takes 0.0410 seconds and 0.0811 seconds precompiling for 6 choices 2025-12-04T09:54:17.4329419Z Autotune Choices Stats: 2025-12-04T09:54:17.4329799Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2630", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005758999846875668, "best_triton_pos": 0} 2025-12-04T09:54:17.4329850Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4329887Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4329935Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4330171Z triton_mm_2630 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4330406Z triton_mm_2631 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4330654Z triton_mm_2633 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4330888Z triton_mm_2632 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4331120Z triton_mm_2628 0.0111 ms 52.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4331352Z triton_mm_2629 0.0150 ms 38.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4331480Z SingleProcess AUTOTUNE benchmarking takes 0.0556 seconds and 0.0734 seconds precompiling for 6 choices 2025-12-04T09:54:17.4331520Z Autotune Choices Stats: 2025-12-04T09:54:17.4331895Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2636", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005640000104904175, "best_triton_pos": 0} 2025-12-04T09:54:17.4331946Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4331985Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4332029Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4332265Z triton_mm_2636 0.0056 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4332499Z triton_mm_2634 0.0058 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4332734Z triton_mm_2639 0.0058 ms 97.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4332967Z triton_mm_2638 0.0062 ms 91.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4333198Z triton_mm_2635 0.0066 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4333443Z triton_mm_2637 0.0068 ms 83.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4333587Z SingleProcess AUTOTUNE benchmarking takes 0.0415 seconds and 0.0812 seconds precompiling for 6 choices 2025-12-04T09:54:17.4333661Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4333703Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4333759Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4333861Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4334366Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4334405Z graph_break [] 2025-12-04T09:54:17.4334449Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4334525Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4334564Z Autotune Choices Stats: 2025-12-04T09:54:17.4334932Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2643", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4334970Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4335010Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4335055Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4335295Z triton_mm_2643 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4335536Z triton_mm_2642 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4335771Z triton_mm_2644 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4336036Z triton_mm_2645 0.0066 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4336269Z triton_mm_2640 0.0121 ms 47.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4336507Z triton_mm_2641 0.0146 ms 39.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4336633Z SingleProcess AUTOTUNE benchmarking takes 0.0540 seconds and 0.0771 seconds precompiling for 6 choices 2025-12-04T09:54:17.4336673Z Autotune Choices Stats: 2025-12-04T09:54:17.4337040Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2647", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T09:54:17.4337094Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4337131Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4337180Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4337429Z triton_mm_2647 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4337660Z triton_mm_2646 0.0058 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4337905Z triton_mm_2649 0.0059 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4338140Z triton_mm_2651 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4338374Z triton_mm_2648 0.0060 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4338606Z triton_mm_2650 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4338732Z SingleProcess AUTOTUNE benchmarking takes 0.0400 seconds and 0.0889 seconds precompiling for 6 choices 2025-12-04T09:54:17.4338774Z Autotune Choices Stats: 2025-12-04T09:54:17.4339140Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2657", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4339191Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4339228Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4339275Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4339511Z triton_mm_2657 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4339745Z triton_mm_2655 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4339979Z triton_mm_2654 0.0063 ms 94.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4340213Z triton_mm_2656 0.0063 ms 94.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4340445Z triton_mm_2652 0.0118 ms 50.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4340676Z triton_mm_2653 0.0154 ms 38.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4340814Z SingleProcess AUTOTUNE benchmarking takes 0.0538 seconds and 0.0790 seconds precompiling for 6 choices 2025-12-04T09:54:17.4340897Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4340942Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4340999Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4341101Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4341599Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4341650Z graph_break [] 2025-12-04T09:54:17.4341693Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4341768Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4341810Z Autotune Choices Stats: 2025-12-04T09:54:17.4342182Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2660", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.00571999978274107, "best_triton_pos": 0} 2025-12-04T09:54:17.4342221Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4342259Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4342302Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4342544Z triton_mm_2660 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4342780Z triton_mm_2661 0.0058 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4343026Z triton_mm_2662 0.0058 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4343260Z triton_mm_2663 0.0059 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4343493Z triton_mm_2658 0.0060 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4343727Z triton_mm_2659 0.0067 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4343855Z SingleProcess AUTOTUNE benchmarking takes 0.0402 seconds and 0.0759 seconds precompiling for 6 choices 2025-12-04T09:54:17.4343894Z Autotune Choices Stats: 2025-12-04T09:54:17.4344261Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2667", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T09:54:17.4344310Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4344349Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4344396Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4344649Z triton_mm_2667 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4344884Z triton_mm_2668 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4345118Z triton_mm_2669 0.0067 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4345359Z triton_mm_2665 0.0117 ms 50.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4345594Z triton_mm_2664 0.0146 ms 40.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4345831Z triton_mm_2666 0.0150 ms 39.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4345987Z SingleProcess AUTOTUNE benchmarking takes 0.0548 seconds and 0.0917 seconds precompiling for 6 choices 2025-12-04T09:54:17.4346027Z Autotune Choices Stats: 2025-12-04T09:54:17.4346397Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2674", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005878999829292297, "best_triton_pos": 0} 2025-12-04T09:54:17.4346437Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4346492Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4346540Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4346777Z triton_mm_2674 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4347011Z triton_mm_2673 0.0061 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4347247Z triton_mm_2675 0.0062 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4347482Z triton_mm_2672 0.0080 ms 73.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4347715Z triton_mm_2671 0.0153 ms 38.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4347948Z triton_mm_2670 0.0166 ms 35.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4348091Z SingleProcess AUTOTUNE benchmarking takes 0.0540 seconds and 0.0817 seconds precompiling for 6 choices 2025-12-04T09:54:17.4348166Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4348211Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4348268Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4348384Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4348881Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4348919Z graph_break [] 2025-12-04T09:54:17.4348964Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4349049Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4349090Z Autotune Choices Stats: 2025-12-04T09:54:17.4349458Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2677", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T09:54:17.4349497Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4349534Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4349578Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4349816Z triton_mm_2677 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4350056Z triton_mm_2678 0.0063 ms 91.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4350291Z triton_mm_2676 0.0066 ms 86.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4350536Z triton_mm_2681 0.0071 ms 81.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4350768Z triton_mm_2679 0.0077 ms 74.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4351003Z triton_mm_2680 0.0082 ms 70.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4351130Z SingleProcess AUTOTUNE benchmarking takes 0.5566 seconds and 0.0757 seconds precompiling for 6 choices 2025-12-04T09:54:17.4351170Z Autotune Choices Stats: 2025-12-04T09:54:17.4351538Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2684", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.006560000125318766, "best_triton_pos": 0} 2025-12-04T09:54:17.4351575Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4351613Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4351659Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4351911Z triton_mm_2684 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4352156Z triton_mm_2685 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4352389Z triton_mm_2687 0.0066 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4352622Z triton_mm_2686 0.0070 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4352865Z triton_mm_2683 0.0074 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4353099Z triton_mm_2682 0.0078 ms 84.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4353225Z SingleProcess AUTOTUNE benchmarking takes 0.0418 seconds and 0.0778 seconds precompiling for 6 choices 2025-12-04T09:54:17.4353266Z Autotune Choices Stats: 2025-12-04T09:54:17.4353633Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2688", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.00583899999037385, "best_triton_pos": 0} 2025-12-04T09:54:17.4353674Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4353711Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4353756Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4353993Z triton_mm_2688 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4354246Z triton_mm_2693 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4354480Z triton_mm_2692 0.0060 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4354716Z triton_mm_2691 0.0062 ms 94.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4354951Z triton_mm_2690 0.0063 ms 92.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4355184Z triton_mm_2689 0.0137 ms 42.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4355309Z SingleProcess AUTOTUNE benchmarking takes 0.0588 seconds and 0.0846 seconds precompiling for 6 choices 2025-12-04T09:54:17.4355382Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4355434Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4355491Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4355592Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4356130Z inductor [('triton_bundler_save_kernel', 136), ('benchmarking.InductorBenchmarker.benchmark_gpu', 19), ('benchmarking.InductorBenchmarker.benchmark', 11), ('async_compile_cache_miss', 10), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('pad_mm_bench', 1), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4356168Z graph_break [] 2025-12-04T09:54:17.4356213Z aten_mm_info [('aten.mm_8_2_5', 1)] 2025-12-04T09:54:17.4356286Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4356327Z Autotune Choices Stats: 2025-12-04T09:54:17.4356708Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2699", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005640000104904175, "best_triton_pos": 0} 2025-12-04T09:54:17.4356749Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4356786Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4356833Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4357072Z triton_mm_2699 0.0056 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4357306Z triton_mm_2697 0.0057 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4357542Z triton_mm_2698 0.0059 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4357789Z triton_mm_2695 0.0059 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4358024Z triton_mm_2694 0.0060 ms 94.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4358258Z triton_mm_2696 0.0060 ms 94.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4358387Z SingleProcess AUTOTUNE benchmarking takes 0.1680 seconds and 0.0742 seconds precompiling for 6 choices 2025-12-04T09:54:17.4358426Z Autotune Choices Stats: 2025-12-04T09:54:17.4358794Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2705", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T09:54:17.4358832Z AUTOTUNE mm(8x2, 2x8) 2025-12-04T09:54:17.4358870Z strides: [2, 1], [8, 1] 2025-12-04T09:54:17.4358917Z dtypes: torch.bfloat16, torch.bfloat16 2025-12-04T09:54:17.4359155Z triton_mm_2705 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4359401Z triton_mm_2703 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4359643Z triton_mm_2702 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4359879Z triton_mm_2704 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4360114Z triton_mm_2700 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4360357Z triton_mm_2701 0.0064 ms 92.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4360485Z SingleProcess AUTOTUNE benchmarking takes 0.0415 seconds and 0.0952 seconds precompiling for 6 choices 2025-12-04T09:54:17.4360526Z Autotune Choices Stats: 2025-12-04T09:54:17.4360894Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2708", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T09:54:17.4360932Z AUTOTUNE mm(8x5, 5x2) 2025-12-04T09:54:17.4360970Z strides: [5, 1], [2, 1] 2025-12-04T09:54:17.4361015Z dtypes: torch.float16, torch.float16 2025-12-04T09:54:17.4361253Z triton_mm_2708 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4361504Z triton_mm_2711 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4361737Z triton_mm_2709 0.0060 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4361974Z triton_mm_2710 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4362209Z triton_mm_2706 0.0064 ms 91.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4362445Z triton_mm_2707 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4362571Z SingleProcess AUTOTUNE benchmarking takes 0.0414 seconds and 0.0719 seconds precompiling for 6 choices 2025-12-04T09:54:17.4362644Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:54:17.4362685Z frames [('total', 1), ('ok', 1)] 2025-12-04T09:54:17.4362743Z stats [('calls_captured', 4), ('unique_graphs', 1)] 2025-12-04T09:54:17.4362854Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T09:54:17.4363380Z inductor [('triton_bundler_save_kernel', 176), ('benchmarking.InductorBenchmarker.benchmark_gpu', 20), ('async_compile_cache_miss', 14), ('benchmarking.InductorBenchmarker.benchmark', 14), ('select_algorithm_num_precompiles', 6), ('generated_module_cache_miss', 5), ('fxgraph_cache_miss', 1), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T09:54:17.4363419Z graph_break [] 2025-12-04T09:54:17.4363463Z aten_mm_info [('aten.mm_8_8_8', 1)] 2025-12-04T09:54:17.4363535Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:17.4363576Z Autotune Choices Stats: 2025-12-04T09:54:17.4363958Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_mm_2716", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1", "best_time": 0.0060800001956522465, "best_triton_pos": 0} 2025-12-04T09:54:17.4363998Z AUTOTUNE mm(8x8, 8x8) 2025-12-04T09:54:17.4364035Z strides: [8, 1], [8, 1] 2025-12-04T09:54:17.4364081Z dtypes: torch.float32, torch.float32 2025-12-04T09:54:17.4364323Z triton_mm_2716 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4364559Z triton_mm_2717 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4364794Z triton_mm_2714 0.0122 ms 50.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4365026Z triton_mm_2715 0.0123 ms 49.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4365274Z triton_mm_2712 0.0128 ms 47.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T09:54:17.4365505Z triton_mm_2713 0.0160 ms 38.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, EVEN_K=False, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=1 2025-12-04T09:54:17.4365631Z SingleProcess AUTOTUNE benchmarking takes 0.0626 seconds and 0.2064 seconds precompiling for 6 choices 2025-12-04T09:54:17.4365866Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_pattern_matcher/inductor.test_pattern_matcher-9f787b25300815d0.xml - 2025-12-04T09:54:17.4365958Z =========================== short test summary info ============================ 2025-12-04T09:54:17.4366152Z FAILED [0.5672s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:54:17.4366190Z Searched string: 2025-12-04T09:54:17.4366250Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:54:17.4366253Z 2025-12-04T09:54:17.4366306Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:54:17.4366308Z 2025-12-04T09:54:17.4366369Z a_mask = offs_k[None, :] < (K - k_idx * BLOCK_K) 2025-12-04T09:54:17.4366425Z b_mask = offs_k[:, None] < (K - k_idx * BLOCK_K) 2025-12-04T09:54:17.4366427Z 2025-12-04T09:54:17.4366499Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:54:17.4366556Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:54:17.4366558Z 2025-12-04T09:54:17.4366603Z idx_m = offs_a_m[:, None] 2025-12-04T09:54:17.4366644Z idx_n = a_k_idx_vals 2025-12-04T09:54:17.4366685Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.4366758Z a = tl.load(A + (xindex), mask=a_mask, other=0.0) 2025-12-04T09:54:17.4366760Z 2025-12-04T09:54:17.4366800Z idx_m = b_k_idx_vals 2025-12-04T09:54:17.4366843Z idx_n = offs_b_n[None, :] 2025-12-04T09:54:17.4366885Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.4366939Z b = tl.load(B + (xindex), mask=b_mask, other=0.0) 2025-12-04T09:54:17.4366941Z 2025-12-04T09:54:17.4366942Z 2025-12-04T09:54:17.4367013Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:54:17.4367015Z 2025-12-04T09:54:17.4367017Z 2025-12-04T09:54:17.4367070Z # rematerialize rm and rn to save registers 2025-12-04T09:54:17.4367127Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:54:17.4367190Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:54:17.4367230Z idx_m = rm[:, None] 2025-12-04T09:54:17.4367269Z idx_n = rn[None, :] 2025-12-04T09:54:17.4367312Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:54:17.4367316Z 2025-12-04T09:54:17.4367360Z # inductor generates a suffix 2025-12-04T09:54:17.4367400Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.4367488Z tl.store(out_ptr0 + (tl.broadcast_to(xindex, [BLOCK_M, BLOCK_N])), acc, mask) 2025-12-04T09:54:17.4367528Z ''', device_str='cuda') 2025-12-04T09:54:17.4367530Z 2025-12-04T09:54:17.4367532Z 2025-12-04T09:54:17.4367576Z async_compile.wait(globals()) 2025-12-04T09:54:17.4367615Z del async_compile 2025-12-04T09:54:17.4367617Z 2025-12-04T09:54:17.4367652Z class Runner: 2025-12-04T09:54:17.4367698Z def __init__(self, partitions): 2025-12-04T09:54:17.4367743Z self.partitions = partitions 2025-12-04T09:54:17.4367747Z 2025-12-04T09:54:17.4367796Z def recursively_apply_fns(self, fns): 2025-12-04T09:54:17.4367840Z new_callables = [] 2025-12-04T09:54:17.4367893Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:54:17.4367939Z new_callables.append(fn(c)) 2025-12-04T09:54:17.4368004Z self.partitions = new_callables 2025-12-04T09:54:17.4368007Z 2025-12-04T09:54:17.4368049Z def call(self, args): 2025-12-04T09:54:17.4368089Z arg0_1, arg1_1 = args 2025-12-04T09:54:17.4368126Z args.clear() 2025-12-04T09:54:17.4368177Z assert_size_stride(arg0_1, (8, 8), (8, 1)) 2025-12-04T09:54:17.4368227Z assert_size_stride(arg1_1, (8, 8), (8, 1)) 2025-12-04T09:54:17.4368274Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:54:17.4368317Z torch.cuda.set_device(0) 2025-12-04T09:54:17.4368385Z buf0 = empty_strided_cuda((8, 8), (8, 1), torch.float32) 2025-12-04T09:54:17.4368475Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:54:17.4368520Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.4368596Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 64, stream=stream0) 2025-12-04T09:54:17.4368633Z del arg0_1 2025-12-04T09:54:17.4368696Z buf1 = empty_strided_cuda((8, 8), (8, 1), torch.float32) 2025-12-04T09:54:17.4368802Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:54:17.4368847Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.4368936Z triton_tem_fused__to_copy_mm_1.run(arg1_1, buf0, buf1, 1, 1, 1, stream=stream0) 2025-12-04T09:54:17.4368975Z del arg1_1 2025-12-04T09:54:17.4369012Z del buf0 2025-12-04T09:54:17.4369051Z return (buf1, ) 2025-12-04T09:54:17.4369054Z 2025-12-04T09:54:17.4369098Z runner = Runner(partitions=[]) 2025-12-04T09:54:17.4369136Z call = runner.call 2025-12-04T09:54:17.4369220Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:54:17.4369222Z 2025-12-04T09:54:17.4369223Z 2025-12-04T09:54:17.4369285Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:54:17.4369342Z from torch._dynamo.testing import rand_strided 2025-12-04T09:54:17.4369408Z from torch._inductor.utils import print_performance 2025-12-04T09:54:17.4369497Z arg0_1 = rand_strided((8, 8), (8, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:54:17.4369574Z arg1_1 = rand_strided((8, 8), (8, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:54:17.4369619Z fn = lambda: call([arg0_1, arg1_1]) 2025-12-04T09:54:17.4369688Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:54:17.4369691Z 2025-12-04T09:54:17.4369692Z 2025-12-04T09:54:17.4369732Z if __name__ == "__main__": 2025-12-04T09:54:17.4369815Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:54:17.4369882Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:54:17.4369921Z From CHECK: .to( 2025-12-04T09:54:17.4369923Z 2025-12-04T09:54:17.4369925Z 2025-12-04T09:54:17.4370007Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:17.4370141Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm 2025-12-04T09:54:17.4370146Z 2025-12-04T09:54:17.4370237Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:17.4370436Z FAILED [1.4161s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works - RuntimeError: Expected to find "tl.dot" but did not find it 2025-12-04T09:54:17.4370472Z Searched string: 2025-12-04T09:54:17.4370518Z # inductor generates a suffix 2025-12-04T09:54:17.4370559Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.4370689Z tmp0 = tl.load(in_ptr2 + (tl.broadcast_to(idx_n, [BLOCK_M, BLOCK_N])), mask, eviction_policy='evict_last').to(tl.float32) 2025-12-04T09:54:17.4370813Z tmp2 = tl.load(in_ptr3 + (tl.broadcast_to(idx_n, [BLOCK_M, BLOCK_N])), mask, eviction_policy='evict_last').to(tl.float32) 2025-12-04T09:54:17.4370855Z tmp1 = acc * tmp0 2025-12-04T09:54:17.4370894Z tmp3 = tmp1 + tmp2 2025-12-04T09:54:17.4370991Z tl.store(out_ptr1 + (tl.broadcast_to(idx_n + 8*idx_m, [BLOCK_M, BLOCK_N])), tmp3, mask) 2025-12-04T09:54:17.4371044Z ''', device_str='cuda') 2025-12-04T09:54:17.4371047Z 2025-12-04T09:54:17.4371050Z 2025-12-04T09:54:17.4371093Z async_compile.wait(globals()) 2025-12-04T09:54:17.4371129Z del async_compile 2025-12-04T09:54:17.4371133Z 2025-12-04T09:54:17.4371168Z class Runner: 2025-12-04T09:54:17.4371213Z def __init__(self, partitions): 2025-12-04T09:54:17.4371260Z self.partitions = partitions 2025-12-04T09:54:17.4371262Z 2025-12-04T09:54:17.4371311Z def recursively_apply_fns(self, fns): 2025-12-04T09:54:17.4371351Z new_callables = [] 2025-12-04T09:54:17.4371402Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:54:17.4371449Z new_callables.append(fn(c)) 2025-12-04T09:54:17.4371493Z self.partitions = new_callables 2025-12-04T09:54:17.4371497Z 2025-12-04T09:54:17.4371536Z def call(self, args): 2025-12-04T09:54:17.4371585Z arg0_1, arg1_1, arg2_1, arg3_1 = args 2025-12-04T09:54:17.4371621Z args.clear() 2025-12-04T09:54:17.4371673Z assert_size_stride(arg0_1, (2, 8), (8, 1)) 2025-12-04T09:54:17.4371722Z assert_size_stride(arg1_1, (8, 2), (2, 1)) 2025-12-04T09:54:17.4371769Z assert_size_stride(arg2_1, (8, ), (1, )) 2025-12-04T09:54:17.4371817Z assert_size_stride(arg3_1, (8, ), (1, )) 2025-12-04T09:54:17.4371863Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:54:17.4371908Z torch.cuda.set_device(0) 2025-12-04T09:54:17.4371976Z buf0 = empty_strided_cuda((2, 8), (8, 1), torch.bfloat16) 2025-12-04T09:54:17.4372065Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:54:17.4372118Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.4372194Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 16, stream=stream0) 2025-12-04T09:54:17.4372230Z del arg0_1 2025-12-04T09:54:17.4372296Z buf1 = empty_strided_cuda((8, 8), (8, 1), torch.bfloat16) 2025-12-04T09:54:17.4372387Z # Topologically Sorted Source Nodes: [mm], Original ATen: [aten.mm] 2025-12-04T09:54:17.4372432Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.4372502Z triton_poi_fused_mm_1.run(arg1_1, buf1, 64, stream=stream0) 2025-12-04T09:54:17.4372539Z del arg1_1 2025-12-04T09:54:17.4372602Z buf2 = empty_strided_cuda((8, 8), (8, 1), torch.bfloat16) 2025-12-04T09:54:17.4372702Z # Topologically Sorted Source Nodes: [to, mm], Original ATen: [aten._to_copy, aten.mm] 2025-12-04T09:54:17.4372745Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.4372819Z triton_poi_fused__to_copy_mm_2.run(buf0, buf2, 64, stream=stream0) 2025-12-04T09:54:17.4372856Z del buf0 2025-12-04T09:54:17.4372934Z buf4 = empty_strided_cuda((8, 8), (8, 1), torch.bfloat16) 2025-12-04T09:54:17.4373068Z # Topologically Sorted Source Nodes: [mm, to, mul, add], Original ATen: [aten.mm, aten._to_copy, aten.mul, aten.add] 2025-12-04T09:54:17.4373113Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.4373226Z triton_tem_fused__to_copy_add_mm_mul_3.run(buf1, buf2, arg2_1, arg3_1, buf4, 1, 1, 1, stream=stream0) 2025-12-04T09:54:17.4373263Z del arg2_1 2025-12-04T09:54:17.4373299Z del arg3_1 2025-12-04T09:54:17.4373334Z del buf1 2025-12-04T09:54:17.4373371Z del buf2 2025-12-04T09:54:17.4373408Z return (buf4, ) 2025-12-04T09:54:17.4373410Z 2025-12-04T09:54:17.4373456Z runner = Runner(partitions=[]) 2025-12-04T09:54:17.4373493Z call = runner.call 2025-12-04T09:54:17.4373561Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:54:17.4373564Z 2025-12-04T09:54:17.4373566Z 2025-12-04T09:54:17.4373626Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:54:17.4373682Z from torch._dynamo.testing import rand_strided 2025-12-04T09:54:17.4373745Z from torch._inductor.utils import print_performance 2025-12-04T09:54:17.4373822Z arg0_1 = rand_strided((2, 8), (8, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:54:17.4373914Z arg1_1 = rand_strided((8, 2), (2, 1), device='cuda:0', dtype=torch.bfloat16) 2025-12-04T09:54:17.4373989Z arg2_1 = rand_strided((8, ), (1, ), device='cuda:0', dtype=torch.bfloat16) 2025-12-04T09:54:17.4374061Z arg3_1 = rand_strided((8, ), (1, ), device='cuda:0', dtype=torch.bfloat16) 2025-12-04T09:54:17.4374117Z fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1]) 2025-12-04T09:54:17.4374186Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:54:17.4374189Z 2025-12-04T09:54:17.4374190Z 2025-12-04T09:54:17.4374233Z if __name__ == "__main__": 2025-12-04T09:54:17.4374316Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:54:17.4374390Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:54:17.4374429Z From CHECK: tl.dot 2025-12-04T09:54:17.4374431Z 2025-12-04T09:54:17.4374433Z 2025-12-04T09:54:17.4374508Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:17.4374658Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_epi_works 2025-12-04T09:54:17.4374660Z 2025-12-04T09:54:17.4374751Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:17.4374945Z FAILED [0.9712s] inductor/test_pattern_matcher.py::TestPatternMatcher::test_mixed_mm_epi_works - RuntimeError: Expected to find ".to(" but did not find it 2025-12-04T09:54:17.4374984Z Searched string: 2025-12-04T09:54:17.4375042Z acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=ACC_TYPE) 2025-12-04T09:54:17.4375046Z 2025-12-04T09:54:17.4375106Z for k_idx in range(0, tl.cdiv(K, BLOCK_K)): 2025-12-04T09:54:17.4375108Z 2025-12-04T09:54:17.4375167Z a_mask = offs_k[None, :] < (K - k_idx * BLOCK_K) 2025-12-04T09:54:17.4375221Z b_mask = offs_k[:, None] < (K - k_idx * BLOCK_K) 2025-12-04T09:54:17.4375223Z 2025-12-04T09:54:17.4375279Z a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) 2025-12-04T09:54:17.4375344Z b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) 2025-12-04T09:54:17.4375347Z 2025-12-04T09:54:17.4375391Z idx_m = offs_a_m[:, None] 2025-12-04T09:54:17.4375431Z idx_n = a_k_idx_vals 2025-12-04T09:54:17.4375473Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.4375529Z a = tl.load(A + (xindex), mask=a_mask, other=0.0) 2025-12-04T09:54:17.4375532Z 2025-12-04T09:54:17.4375572Z idx_m = b_k_idx_vals 2025-12-04T09:54:17.4375615Z idx_n = offs_b_n[None, :] 2025-12-04T09:54:17.4375657Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.4375710Z b = tl.load(B + (xindex), mask=b_mask, other=0.0) 2025-12-04T09:54:17.4375713Z 2025-12-04T09:54:17.4375715Z 2025-12-04T09:54:17.4375794Z acc += tl.dot(a, b, allow_tf32=ALLOW_TF32, out_dtype=ACC_TYPE) 2025-12-04T09:54:17.4375796Z 2025-12-04T09:54:17.4375798Z 2025-12-04T09:54:17.4375849Z # rematerialize rm and rn to save registers 2025-12-04T09:54:17.4375905Z rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) 2025-12-04T09:54:17.4375988Z rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) 2025-12-04T09:54:17.4376029Z idx_m = rm[:, None] 2025-12-04T09:54:17.4376067Z idx_n = rn[None, :] 2025-12-04T09:54:17.4376110Z mask = (idx_m < M) & (idx_n < N) 2025-12-04T09:54:17.4376112Z 2025-12-04T09:54:17.4376154Z # inductor generates a suffix 2025-12-04T09:54:17.4376195Z xindex = idx_n + 8*idx_m 2025-12-04T09:54:17.4376307Z tmp0 = tl.load(in_ptr2 + (tl.broadcast_to(idx_n, [BLOCK_M, BLOCK_N])), mask, eviction_policy='evict_last') 2025-12-04T09:54:17.4376419Z tmp2 = tl.load(in_ptr3 + (tl.broadcast_to(idx_n, [BLOCK_M, BLOCK_N])), mask, eviction_policy='evict_last') 2025-12-04T09:54:17.4376458Z tmp1 = acc * tmp0 2025-12-04T09:54:17.4376499Z tmp3 = tmp1 + tmp2 2025-12-04T09:54:17.4376593Z tl.store(out_ptr1 + (tl.broadcast_to(idx_n + 8*idx_m, [BLOCK_M, BLOCK_N])), tmp3, mask) 2025-12-04T09:54:17.4376632Z ''', device_str='cuda') 2025-12-04T09:54:17.4376652Z 2025-12-04T09:54:17.4376654Z 2025-12-04T09:54:17.4376698Z async_compile.wait(globals()) 2025-12-04T09:54:17.4376735Z del async_compile 2025-12-04T09:54:17.4376737Z 2025-12-04T09:54:17.4376772Z class Runner: 2025-12-04T09:54:17.4376816Z def __init__(self, partitions): 2025-12-04T09:54:17.4376860Z self.partitions = partitions 2025-12-04T09:54:17.4376862Z 2025-12-04T09:54:17.4376915Z def recursively_apply_fns(self, fns): 2025-12-04T09:54:17.4376954Z new_callables = [] 2025-12-04T09:54:17.4377007Z for fn, c in zip(fns, self.partitions): 2025-12-04T09:54:17.4377052Z new_callables.append(fn(c)) 2025-12-04T09:54:17.4377098Z self.partitions = new_callables 2025-12-04T09:54:17.4377100Z 2025-12-04T09:54:17.4377141Z def call(self, args): 2025-12-04T09:54:17.4377187Z arg0_1, arg1_1, arg2_1, arg3_1 = args 2025-12-04T09:54:17.4377224Z args.clear() 2025-12-04T09:54:17.4377275Z assert_size_stride(arg0_1, (8, 8), (8, 1)) 2025-12-04T09:54:17.4377325Z assert_size_stride(arg1_1, (8, 8), (8, 1)) 2025-12-04T09:54:17.4377373Z assert_size_stride(arg2_1, (8, ), (1, )) 2025-12-04T09:54:17.4377419Z assert_size_stride(arg3_1, (8, ), (1, )) 2025-12-04T09:54:17.4377469Z with torch.cuda._DeviceGuard(0): 2025-12-04T09:54:17.4377511Z torch.cuda.set_device(0) 2025-12-04T09:54:17.4377576Z buf0 = empty_strided_cuda((8, 8), (8, 1), torch.float32) 2025-12-04T09:54:17.4377664Z # Topologically Sorted Source Nodes: [to], Original ATen: [aten._to_copy] 2025-12-04T09:54:17.4377708Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.4377797Z triton_poi_fused__to_copy_0.run(arg0_1, buf0, 64, stream=stream0) 2025-12-04T09:54:17.4377837Z del arg0_1 2025-12-04T09:54:17.4377900Z buf2 = empty_strided_cuda((8, 8), (8, 1), torch.float32) 2025-12-04T09:54:17.4378032Z # Topologically Sorted Source Nodes: [to, mm, mul, add], Original ATen: [aten._to_copy, aten.mm, aten.mul, aten.add] 2025-12-04T09:54:17.4378087Z stream0 = get_raw_stream(0) 2025-12-04T09:54:17.4378205Z triton_tem_fused__to_copy_add_mm_mul_1.run(arg1_1, buf0, arg2_1, arg3_1, buf2, 1, 1, 1, stream=stream0) 2025-12-04T09:54:17.4378241Z del arg1_1 2025-12-04T09:54:17.4378278Z del arg2_1 2025-12-04T09:54:17.4378313Z del arg3_1 2025-12-04T09:54:17.4378350Z del buf0 2025-12-04T09:54:17.4378388Z return (buf2, ) 2025-12-04T09:54:17.4378390Z 2025-12-04T09:54:17.4378436Z runner = Runner(partitions=[]) 2025-12-04T09:54:17.4378473Z call = runner.call 2025-12-04T09:54:17.4378541Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T09:54:17.4378543Z 2025-12-04T09:54:17.4378560Z 2025-12-04T09:54:17.4378620Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T09:54:17.4378675Z from torch._dynamo.testing import rand_strided 2025-12-04T09:54:17.4378738Z from torch._inductor.utils import print_performance 2025-12-04T09:54:17.4378816Z arg0_1 = rand_strided((8, 8), (8, 1), device='cuda:0', dtype=torch.int8) 2025-12-04T09:54:17.4378893Z arg1_1 = rand_strided((8, 8), (8, 1), device='cuda:0', dtype=torch.float32) 2025-12-04T09:54:17.4378966Z arg2_1 = rand_strided((8, ), (1, ), device='cuda:0', dtype=torch.float32) 2025-12-04T09:54:17.4379039Z arg3_1 = rand_strided((8, ), (1, ), device='cuda:0', dtype=torch.float32) 2025-12-04T09:54:17.4379096Z fn = lambda: call([arg0_1, arg1_1, arg2_1, arg3_1]) 2025-12-04T09:54:17.4379165Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T09:54:17.4379167Z 2025-12-04T09:54:17.4379170Z 2025-12-04T09:54:17.4379211Z if __name__ == "__main__": 2025-12-04T09:54:17.4379294Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T09:54:17.4379360Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T09:54:17.4379399Z From CHECK: .to( 2025-12-04T09:54:17.4379401Z 2025-12-04T09:54:17.4379416Z 2025-12-04T09:54:17.4379490Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:17.4379635Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_pattern_matcher.py TestPatternMatcher.test_mixed_mm_epi_works 2025-12-04T09:54:17.4379637Z 2025-12-04T09:54:17.4379723Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:17.4379801Z ============ 3 failed, 197 passed, 50 skipped in 567.62s (0:09:27) ============= 2025-12-04T09:54:17.4379803Z 2025-12-04T09:54:17.4379983Z FINISHED PRINTING LOG FILE of inductor/test_pattern_matcher 1/1 (test/test-reports/inductor.test_pattern_matcher_1.1_8672400c1baf9dfa_.log) 2025-12-04T09:54:17.4379986Z 2025-12-04T09:54:17.4380108Z Finished inductor/test_pattern_matcher 1/1 ... [2025-12-04 09:54:17.268939][5636077.775339492], took 9.65min 2025-12-04T09:54:17.4380346Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:54:17.4380463Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:54:17.4380558Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T09:54:17.4380606Z Uploading artifacts took 0.00 seconds 2025-12-04T09:54:17.4380656Z inductor/test_pattern_matcher 1/1 failed! 2025-12-04T09:54:17.4380748Z Running inductor/test_cpu_repro 1/5 ... [2025-12-04 09:54:17.376699][5636077.883097348] 2025-12-04T09:54:17.4380796Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:54:17.4381161Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cpu_repro.py', '--shard-id=1', '--num-shards=5', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:54:17.377013] 2025-12-04T09:54:24.4006418Z 2025-12-04T09:54:24.4007552Z inductor/test_cpu_repro 1/5 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cpu_repro_1.5_6e55d20879e02348_.log 2025-12-04T09:54:24.4007908Z Running 0 items in this shard: 2025-12-04T09:54:24.4007988Z 2025-12-04T09:54:24.4008107Z Finished inductor/test_cpu_repro 1/5 ... [2025-12-04 09:54:24.400074][5636084.906474931], took 0.12min 2025-12-04T09:54:24.4017292Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:54:24.4918334Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:54:24.4935342Z Running inductor/test_compiled_autograd 1/2 ... [2025-12-04 09:54:24.493165][5636084.99956306] 2025-12-04T09:54:24.4936664Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:54:24.4938224Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_compiled_autograd.py', '--shard-id=1', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:54:24.493478] 2025-12-04T09:54:40.8882139Z 2025-12-04T09:54:40.8883265Z PRINTING LOG FILE of inductor/test_compiled_autograd 1/2 (test/test-reports/inductor.test_compiled_autograd_1.2_0fbf59f36039e870_.log) 2025-12-04T09:54:40.8884691Z Test results will be stored in test-reports/python-pytest/inductor.test_compiled_autograd/inductor.test_compiled_autograd-420ab70ad85293fc.xml 2025-12-04T09:54:40.8885630Z ============================= test session starts ============================== 2025-12-04T09:54:40.8886526Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T09:54:40.8887169Z cachedir: .pytest_cache 2025-12-04T09:54:40.8887926Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T09:54:40.8889456Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T09:54:40.8889860Z configfile: pytest.ini 2025-12-04T09:54:40.8890611Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T09:54:40.8891419Z collecting ... collected 861 items 2025-12-04T09:54:40.8891901Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T09:54:40.8922827Z Running 50 items in this shard: test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel, test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.8953281Z 2025-12-04T09:54:40.8954485Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py W1204 09:54:33.405000 176019 site-packages/torch/_inductor/utils.py:2565] [!0/0/0] DeviceCopy in input program 2025-12-04T09:54:40.8956192Z W1204 09:54:35.705000 176019 site-packages/torch/_inductor/utils.py:2565] [!1/6/0] DeviceCopy in input program 2025-12-04T09:54:40.8956877Z PASSED [7.1532s] [ 2%] 2025-12-04T09:54:40.8957840Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0016s] [ 2%] 2025-12-04T09:54:40.8959452Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0008s] [ 2%] 2025-12-04T09:54:40.8961039Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0008s] [ 2%] 2025-12-04T09:54:40.8962629Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0007s] [ 2%] 2025-12-04T09:54:40.8964208Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8965834Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0005s] [ 2%] 2025-12-04T09:54:40.8967453Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8969026Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8970617Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8972196Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8973778Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8975370Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0005s] [ 2%] 2025-12-04T09:54:40.8976972Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8978551Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8980173Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8981785Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8983380Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8984955Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0005s] [ 2%] 2025-12-04T09:54:40.8987247Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8988752Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8989307Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8989864Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8990397Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8990926Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8991465Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8991997Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8992615Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8993144Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8993673Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8994207Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8994740Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8995273Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8995807Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8996354Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8996890Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8997364Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8997914Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.8998387Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8998862Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8999361Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.8999847Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.9000326Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.9000800Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.9001271Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.9001747Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.9002224Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.9002730Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.9003206Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0004s] [ 2%] 2025-12-04T09:54:40.9003685Z inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel <- test/test_autograd.py FAILED [0.0003s] [ 2%] 2025-12-04T09:54:40.9003943Z 2025-12-04T09:54:40.9004010Z =================================== FAILURES =================================== 2025-12-04T09:54:40.9004233Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9004440Z Traceback (most recent call last): 2025-12-04T09:54:40.9004698Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9004942Z method(*args, **kwargs) 2025-12-04T09:54:40.9005139Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9005342Z stack.enter_context(ctx) 2025-12-04T09:54:40.9005516Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9005700Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9005874Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9006091Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9006242Z AttributeError: args 2025-12-04T09:54:40.9006307Z 2025-12-04T09:54:40.9006386Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9006682Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9006903Z 2025-12-04T09:54:40.9007012Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9007220Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9007705Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9008182Z warnings.warn( 2025-12-04T09:54:40.9008989Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9009787Z b_grad = a.grad 2025-12-04T09:54:40.9010545Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9011331Z c_grad = a.grad 2025-12-04T09:54:40.9011505Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9011706Z Traceback (most recent call last): 2025-12-04T09:54:40.9011961Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9012198Z method(*args, **kwargs) 2025-12-04T09:54:40.9012388Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9012589Z stack.enter_context(ctx) 2025-12-04T09:54:40.9012760Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9012940Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9013115Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9013294Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9013417Z AttributeError: args 2025-12-04T09:54:40.9013482Z 2025-12-04T09:54:40.9013556Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9013848Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9014070Z 2025-12-04T09:54:40.9014157Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9014357Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9014827Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9015259Z warnings.warn( 2025-12-04T09:54:40.9016114Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9016889Z b_grad = a.grad 2025-12-04T09:54:40.9017672Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9018443Z c_grad = a.grad 2025-12-04T09:54:40.9018613Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9018818Z Traceback (most recent call last): 2025-12-04T09:54:40.9019052Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9019286Z method(*args, **kwargs) 2025-12-04T09:54:40.9019474Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9019672Z stack.enter_context(ctx) 2025-12-04T09:54:40.9019844Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9020025Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9020202Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9020416Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9031034Z AttributeError: args 2025-12-04T09:54:40.9031117Z 2025-12-04T09:54:40.9031201Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9031585Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9031803Z 2025-12-04T09:54:40.9031899Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9032109Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9032594Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9033033Z warnings.warn( 2025-12-04T09:54:40.9033820Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9034605Z b_grad = a.grad 2025-12-04T09:54:40.9035372Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9036213Z c_grad = a.grad 2025-12-04T09:54:40.9036425Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9036635Z Traceback (most recent call last): 2025-12-04T09:54:40.9036877Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9037118Z method(*args, **kwargs) 2025-12-04T09:54:40.9037314Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9037519Z stack.enter_context(ctx) 2025-12-04T09:54:40.9037695Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9037900Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9038080Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9038265Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9038395Z AttributeError: args 2025-12-04T09:54:40.9038463Z 2025-12-04T09:54:40.9038540Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9038840Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9039067Z 2025-12-04T09:54:40.9039154Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9039361Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9039841Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9040278Z warnings.warn( 2025-12-04T09:54:40.9041059Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9041870Z b_grad = a.grad 2025-12-04T09:54:40.9042633Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9043410Z c_grad = a.grad 2025-12-04T09:54:40.9043588Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9043792Z Traceback (most recent call last): 2025-12-04T09:54:40.9044029Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9044268Z method(*args, **kwargs) 2025-12-04T09:54:40.9044464Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9044684Z stack.enter_context(ctx) 2025-12-04T09:54:40.9044862Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9045051Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9045229Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9045425Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9045554Z AttributeError: args 2025-12-04T09:54:40.9045618Z 2025-12-04T09:54:40.9045699Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9046039Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9046259Z 2025-12-04T09:54:40.9046350Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9046554Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9047054Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9047482Z warnings.warn( 2025-12-04T09:54:40.9048250Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9049020Z b_grad = a.grad 2025-12-04T09:54:40.9049785Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9050578Z c_grad = a.grad 2025-12-04T09:54:40.9050748Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9050949Z Traceback (most recent call last): 2025-12-04T09:54:40.9051184Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9051417Z method(*args, **kwargs) 2025-12-04T09:54:40.9051607Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9051804Z stack.enter_context(ctx) 2025-12-04T09:54:40.9051973Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9052153Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9052325Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9052503Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9052623Z AttributeError: args 2025-12-04T09:54:40.9052685Z 2025-12-04T09:54:40.9052760Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9053049Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9053269Z 2025-12-04T09:54:40.9053355Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9053572Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9054059Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9054492Z warnings.warn( 2025-12-04T09:54:40.9055263Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9056076Z b_grad = a.grad 2025-12-04T09:54:40.9056850Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9057627Z c_grad = a.grad 2025-12-04T09:54:40.9057795Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9057994Z Traceback (most recent call last): 2025-12-04T09:54:40.9058226Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9058457Z method(*args, **kwargs) 2025-12-04T09:54:40.9058646Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9058845Z stack.enter_context(ctx) 2025-12-04T09:54:40.9059018Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9059213Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9059384Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9059561Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9059681Z AttributeError: args 2025-12-04T09:54:40.9059746Z 2025-12-04T09:54:40.9059817Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9060107Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9060330Z 2025-12-04T09:54:40.9060417Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9060615Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9061088Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9061520Z warnings.warn( 2025-12-04T09:54:40.9062286Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9063079Z b_grad = a.grad 2025-12-04T09:54:40.9063851Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9064623Z c_grad = a.grad 2025-12-04T09:54:40.9064793Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9064995Z Traceback (most recent call last): 2025-12-04T09:54:40.9065248Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9065481Z method(*args, **kwargs) 2025-12-04T09:54:40.9065668Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9065868Z stack.enter_context(ctx) 2025-12-04T09:54:40.9066076Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9066256Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9066428Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9066606Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9066727Z AttributeError: args 2025-12-04T09:54:40.9066788Z 2025-12-04T09:54:40.9066862Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9067151Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9067368Z 2025-12-04T09:54:40.9067456Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9067654Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9068150Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9068578Z warnings.warn( 2025-12-04T09:54:40.9069348Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9070121Z b_grad = a.grad 2025-12-04T09:54:40.9070887Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9071661Z c_grad = a.grad 2025-12-04T09:54:40.9071846Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9072046Z Traceback (most recent call last): 2025-12-04T09:54:40.9072277Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9072510Z method(*args, **kwargs) 2025-12-04T09:54:40.9072715Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9072914Z stack.enter_context(ctx) 2025-12-04T09:54:40.9073085Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9073267Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9073438Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9073615Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9073736Z AttributeError: args 2025-12-04T09:54:40.9073802Z 2025-12-04T09:54:40.9073876Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9074182Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9074402Z 2025-12-04T09:54:40.9074489Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9074689Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9075159Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9075589Z warnings.warn( 2025-12-04T09:54:40.9076408Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9077200Z b_grad = a.grad 2025-12-04T09:54:40.9077960Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9078733Z c_grad = a.grad 2025-12-04T09:54:40.9078905Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9079105Z Traceback (most recent call last): 2025-12-04T09:54:40.9079337Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9079571Z method(*args, **kwargs) 2025-12-04T09:54:40.9079756Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9079958Z stack.enter_context(ctx) 2025-12-04T09:54:40.9080126Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9080306Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9080474Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9080649Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9080784Z AttributeError: args 2025-12-04T09:54:40.9080846Z 2025-12-04T09:54:40.9080922Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9081210Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9081428Z 2025-12-04T09:54:40.9081528Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9081724Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9082187Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9082610Z warnings.warn( 2025-12-04T09:54:40.9083396Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9084174Z b_grad = a.grad 2025-12-04T09:54:40.9084932Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9085707Z c_grad = a.grad 2025-12-04T09:54:40.9085878Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9086129Z Traceback (most recent call last): 2025-12-04T09:54:40.9086363Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9086596Z method(*args, **kwargs) 2025-12-04T09:54:40.9086781Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9086979Z stack.enter_context(ctx) 2025-12-04T09:54:40.9087149Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9087329Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9087501Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9087678Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9087800Z AttributeError: args 2025-12-04T09:54:40.9087864Z 2025-12-04T09:54:40.9087937Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9088226Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9088447Z 2025-12-04T09:54:40.9088534Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9088732Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9089199Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9089648Z warnings.warn( 2025-12-04T09:54:40.9090441Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9091217Z b_grad = a.grad 2025-12-04T09:54:40.9092004Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9092778Z c_grad = a.grad 2025-12-04T09:54:40.9092951Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9093154Z Traceback (most recent call last): 2025-12-04T09:54:40.9093392Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9093628Z method(*args, **kwargs) 2025-12-04T09:54:40.9093819Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9094023Z stack.enter_context(ctx) 2025-12-04T09:54:40.9094196Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9094381Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9094558Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9094738Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9094863Z AttributeError: args 2025-12-04T09:54:40.9094925Z 2025-12-04T09:54:40.9095023Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9095316Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9095534Z 2025-12-04T09:54:40.9095627Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9095828Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9096345Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9096787Z warnings.warn( 2025-12-04T09:54:40.9097556Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9098333Z b_grad = a.grad 2025-12-04T09:54:40.9099097Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9099921Z c_grad = a.grad 2025-12-04T09:54:40.9100094Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9100295Z Traceback (most recent call last): 2025-12-04T09:54:40.9100528Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9100764Z method(*args, **kwargs) 2025-12-04T09:54:40.9100953Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9101149Z stack.enter_context(ctx) 2025-12-04T09:54:40.9101323Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9101520Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9101697Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9101877Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9102003Z AttributeError: args 2025-12-04T09:54:40.9102070Z 2025-12-04T09:54:40.9102144Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9102436Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9102659Z 2025-12-04T09:54:40.9102746Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9102949Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9103420Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9103852Z warnings.warn( 2025-12-04T09:54:40.9104626Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9105431Z b_grad = a.grad 2025-12-04T09:54:40.9106240Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9107020Z c_grad = a.grad 2025-12-04T09:54:40.9107194Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9107397Z Traceback (most recent call last): 2025-12-04T09:54:40.9107634Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9107870Z method(*args, **kwargs) 2025-12-04T09:54:40.9108059Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9108287Z stack.enter_context(ctx) 2025-12-04T09:54:40.9108462Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9108641Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9108831Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9109010Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9109125Z AttributeError: args 2025-12-04T09:54:40.9109186Z 2025-12-04T09:54:40.9109259Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9109545Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9109760Z 2025-12-04T09:54:40.9109846Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9110042Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9110535Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9110962Z warnings.warn( 2025-12-04T09:54:40.9111731Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9112501Z b_grad = a.grad 2025-12-04T09:54:40.9113257Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9114044Z c_grad = a.grad 2025-12-04T09:54:40.9114211Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9114407Z Traceback (most recent call last): 2025-12-04T09:54:40.9114637Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9114866Z method(*args, **kwargs) 2025-12-04T09:54:40.9115051Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9115246Z stack.enter_context(ctx) 2025-12-04T09:54:40.9115412Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9115593Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9115762Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9115988Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9116104Z AttributeError: args 2025-12-04T09:54:40.9116165Z 2025-12-04T09:54:40.9116238Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9116524Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9116737Z 2025-12-04T09:54:40.9116822Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9117037Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9117522Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9117952Z warnings.warn( 2025-12-04T09:54:40.9118713Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9119499Z b_grad = a.grad 2025-12-04T09:54:40.9120254Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9121023Z c_grad = a.grad 2025-12-04T09:54:40.9121190Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9121385Z Traceback (most recent call last): 2025-12-04T09:54:40.9121614Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9121845Z method(*args, **kwargs) 2025-12-04T09:54:40.9122028Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9122223Z stack.enter_context(ctx) 2025-12-04T09:54:40.9122415Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9122590Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9122759Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9122932Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9123048Z AttributeError: args 2025-12-04T09:54:40.9123109Z 2025-12-04T09:54:40.9123179Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9123464Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9123685Z 2025-12-04T09:54:40.9123770Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9123964Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9124432Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9124861Z warnings.warn( 2025-12-04T09:54:40.9125631Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9126459Z b_grad = a.grad 2025-12-04T09:54:40.9127240Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9128008Z c_grad = a.grad 2025-12-04T09:54:40.9128173Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9128369Z Traceback (most recent call last): 2025-12-04T09:54:40.9128614Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9128843Z method(*args, **kwargs) 2025-12-04T09:54:40.9129026Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9129221Z stack.enter_context(ctx) 2025-12-04T09:54:40.9129389Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9129564Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9129732Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9129906Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9130021Z AttributeError: args 2025-12-04T09:54:40.9130082Z 2025-12-04T09:54:40.9130155Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9130445Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9130660Z 2025-12-04T09:54:40.9130746Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9130941Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9131424Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9131848Z warnings.warn( 2025-12-04T09:54:40.9132614Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9133385Z b_grad = a.grad 2025-12-04T09:54:40.9134139Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9134912Z c_grad = a.grad 2025-12-04T09:54:40.9135100Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9135296Z Traceback (most recent call last): 2025-12-04T09:54:40.9135526Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9135756Z method(*args, **kwargs) 2025-12-04T09:54:40.9135998Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9136194Z stack.enter_context(ctx) 2025-12-04T09:54:40.9136363Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9136547Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9136723Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9136906Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9137031Z AttributeError: args 2025-12-04T09:54:40.9137101Z 2025-12-04T09:54:40.9137176Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9137499Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9137721Z 2025-12-04T09:54:40.9137808Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9138014Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9138487Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9138923Z warnings.warn( 2025-12-04T09:54:40.9139699Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9140509Z b_grad = a.grad 2025-12-04T09:54:40.9141276Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9142056Z c_grad = a.grad 2025-12-04T09:54:40.9142232Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9142437Z Traceback (most recent call last): 2025-12-04T09:54:40.9142676Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9142913Z method(*args, **kwargs) 2025-12-04T09:54:40.9143105Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9143307Z stack.enter_context(ctx) 2025-12-04T09:54:40.9143481Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9143665Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9143842Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9144049Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9144176Z AttributeError: args 2025-12-04T09:54:40.9144241Z 2025-12-04T09:54:40.9144321Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9144617Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9144850Z 2025-12-04T09:54:40.9144942Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9145145Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9145621Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9146151Z warnings.warn( 2025-12-04T09:54:40.9146947Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9147730Z b_grad = a.grad 2025-12-04T09:54:40.9148495Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9149271Z c_grad = a.grad 2025-12-04T09:54:40.9149444Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9149668Z Traceback (most recent call last): 2025-12-04T09:54:40.9149906Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9150143Z method(*args, **kwargs) 2025-12-04T09:54:40.9150334Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9150539Z stack.enter_context(ctx) 2025-12-04T09:54:40.9150714Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9150901Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9151080Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9151264Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9151390Z AttributeError: args 2025-12-04T09:54:40.9151458Z 2025-12-04T09:54:40.9151533Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9151829Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9152052Z 2025-12-04T09:54:40.9152139Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9152342Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9152818Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9153268Z warnings.warn( 2025-12-04T09:54:40.9154058Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9154838Z b_grad = a.grad 2025-12-04T09:54:40.9155619Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9156455Z c_grad = a.grad 2025-12-04T09:54:40.9156633Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9156837Z Traceback (most recent call last): 2025-12-04T09:54:40.9157080Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9157318Z method(*args, **kwargs) 2025-12-04T09:54:40.9157510Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9157713Z stack.enter_context(ctx) 2025-12-04T09:54:40.9157891Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9158080Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9158257Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9158438Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9158563Z AttributeError: args 2025-12-04T09:54:40.9158657Z 2025-12-04T09:54:40.9158734Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9159028Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9159251Z 2025-12-04T09:54:40.9159341Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9159543Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9160023Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9160459Z warnings.warn( 2025-12-04T09:54:40.9161232Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9162010Z b_grad = a.grad 2025-12-04T09:54:40.9162799Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9163593Z c_grad = a.grad 2025-12-04T09:54:40.9163767Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9163972Z Traceback (most recent call last): 2025-12-04T09:54:40.9164209Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9164446Z method(*args, **kwargs) 2025-12-04T09:54:40.9164636Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9164840Z stack.enter_context(ctx) 2025-12-04T09:54:40.9165030Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9165219Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9165397Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9165584Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9165712Z AttributeError: args 2025-12-04T09:54:40.9165776Z 2025-12-04T09:54:40.9165855Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9166329Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9166548Z 2025-12-04T09:54:40.9166641Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9166845Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9167317Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9167753Z warnings.warn( 2025-12-04T09:54:40.9168531Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9169337Z b_grad = a.grad 2025-12-04T09:54:40.9170101Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9170889Z c_grad = a.grad 2025-12-04T09:54:40.9171057Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9171257Z Traceback (most recent call last): 2025-12-04T09:54:40.9171491Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9171729Z method(*args, **kwargs) 2025-12-04T09:54:40.9171934Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9172137Z stack.enter_context(ctx) 2025-12-04T09:54:40.9172308Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9172492Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9172690Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9172873Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9172996Z AttributeError: args 2025-12-04T09:54:40.9173063Z 2025-12-04T09:54:40.9173137Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9173429Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9173650Z 2025-12-04T09:54:40.9173736Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9173940Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9174432Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9174866Z warnings.warn( 2025-12-04T09:54:40.9175637Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9176466Z b_grad = a.grad 2025-12-04T09:54:40.9177234Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9178033Z c_grad = a.grad 2025-12-04T09:54:40.9178207Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9178405Z Traceback (most recent call last): 2025-12-04T09:54:40.9178642Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9178877Z method(*args, **kwargs) 2025-12-04T09:54:40.9179070Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9179272Z stack.enter_context(ctx) 2025-12-04T09:54:40.9179446Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9179630Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9179803Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9179982Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9180105Z AttributeError: args 2025-12-04T09:54:40.9180168Z 2025-12-04T09:54:40.9180246Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9180540Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9180754Z 2025-12-04T09:54:40.9180864Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9181068Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9181558Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9181997Z warnings.warn( 2025-12-04T09:54:40.9182791Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9183567Z b_grad = a.grad 2025-12-04T09:54:40.9184332Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9185109Z c_grad = a.grad 2025-12-04T09:54:40.9185283Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9185485Z Traceback (most recent call last): 2025-12-04T09:54:40.9185726Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9185999Z method(*args, **kwargs) 2025-12-04T09:54:40.9186191Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9186418Z stack.enter_context(ctx) 2025-12-04T09:54:40.9186595Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9186780Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9186956Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9187136Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9187258Z AttributeError: args 2025-12-04T09:54:40.9187324Z 2025-12-04T09:54:40.9187397Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9187689Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9187914Z 2025-12-04T09:54:40.9188002Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9188206Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9188683Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9189117Z warnings.warn( 2025-12-04T09:54:40.9189900Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9190696Z b_grad = a.grad 2025-12-04T09:54:40.9191474Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9192249Z c_grad = a.grad 2025-12-04T09:54:40.9192417Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9192619Z Traceback (most recent call last): 2025-12-04T09:54:40.9192874Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9193111Z method(*args, **kwargs) 2025-12-04T09:54:40.9193303Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9193507Z stack.enter_context(ctx) 2025-12-04T09:54:40.9193681Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9193871Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9194049Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9194234Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9194358Z AttributeError: args 2025-12-04T09:54:40.9194422Z 2025-12-04T09:54:40.9194499Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9194801Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9195020Z 2025-12-04T09:54:40.9195112Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9195338Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9195813Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9196276Z warnings.warn( 2025-12-04T09:54:40.9197055Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9197840Z b_grad = a.grad 2025-12-04T09:54:40.9198606Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9199402Z c_grad = a.grad 2025-12-04T09:54:40.9199578Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9199782Z Traceback (most recent call last): 2025-12-04T09:54:40.9200024Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9200281Z method(*args, **kwargs) 2025-12-04T09:54:40.9200476Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9200679Z stack.enter_context(ctx) 2025-12-04T09:54:40.9200856Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9201046Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9201223Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9201403Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9201530Z AttributeError: args 2025-12-04T09:54:40.9201594Z 2025-12-04T09:54:40.9201698Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9201991Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9202209Z 2025-12-04T09:54:40.9202302Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9202508Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9202986Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9203421Z warnings.warn( 2025-12-04T09:54:40.9204194Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9204990Z b_grad = a.grad 2025-12-04T09:54:40.9205754Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9206575Z c_grad = a.grad 2025-12-04T09:54:40.9206749Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9206950Z Traceback (most recent call last): 2025-12-04T09:54:40.9207189Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9207425Z method(*args, **kwargs) 2025-12-04T09:54:40.9207616Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9207819Z stack.enter_context(ctx) 2025-12-04T09:54:40.9207991Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9208179Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9208356Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9208553Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9208679Z AttributeError: args 2025-12-04T09:54:40.9208747Z 2025-12-04T09:54:40.9208821Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9209132Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9209357Z 2025-12-04T09:54:40.9209445Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9209650Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9210126Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9210558Z warnings.warn( 2025-12-04T09:54:40.9211350Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9212132Z b_grad = a.grad 2025-12-04T09:54:40.9212902Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9213679Z c_grad = a.grad 2025-12-04T09:54:40.9213852Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9214090Z Traceback (most recent call last): 2025-12-04T09:54:40.9214327Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9214564Z method(*args, **kwargs) 2025-12-04T09:54:40.9214757Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9214959Z stack.enter_context(ctx) 2025-12-04T09:54:40.9215134Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9215319Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9215498Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9215682Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9215808Z AttributeError: args 2025-12-04T09:54:40.9215872Z 2025-12-04T09:54:40.9215990Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9216288Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9216505Z 2025-12-04T09:54:40.9216598Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9216803Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9217276Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9217724Z warnings.warn( 2025-12-04T09:54:40.9218524Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9219305Z b_grad = a.grad 2025-12-04T09:54:40.9220096Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9220872Z c_grad = a.grad 2025-12-04T09:54:40.9221048Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9221251Z Traceback (most recent call last): 2025-12-04T09:54:40.9221492Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9221730Z method(*args, **kwargs) 2025-12-04T09:54:40.9221922Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9222125Z stack.enter_context(ctx) 2025-12-04T09:54:40.9222298Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9222486Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9222666Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9222849Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9223005Z AttributeError: args 2025-12-04T09:54:40.9223075Z 2025-12-04T09:54:40.9223154Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9223450Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9223674Z 2025-12-04T09:54:40.9223762Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9223966Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9224442Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9224880Z warnings.warn( 2025-12-04T09:54:40.9225652Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9226480Z b_grad = a.grad 2025-12-04T09:54:40.9227271Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9228187Z c_grad = a.grad 2025-12-04T09:54:40.9228362Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9228567Z Traceback (most recent call last): 2025-12-04T09:54:40.9228806Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9229046Z method(*args, **kwargs) 2025-12-04T09:54:40.9229238Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9229442Z stack.enter_context(ctx) 2025-12-04T09:54:40.9229645Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9229832Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9230010Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9230199Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9230326Z AttributeError: args 2025-12-04T09:54:40.9230391Z 2025-12-04T09:54:40.9230469Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9230764Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9230982Z 2025-12-04T09:54:40.9231077Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9231281Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9231756Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9232189Z warnings.warn( 2025-12-04T09:54:40.9232993Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9233774Z b_grad = a.grad 2025-12-04T09:54:40.9234542Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9235320Z c_grad = a.grad 2025-12-04T09:54:40.9235494Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9235696Z Traceback (most recent call last): 2025-12-04T09:54:40.9235984Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9236222Z method(*args, **kwargs) 2025-12-04T09:54:40.9236438Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9236640Z stack.enter_context(ctx) 2025-12-04T09:54:40.9236816Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9237002Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9237206Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9237390Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9237516Z AttributeError: args 2025-12-04T09:54:40.9237584Z 2025-12-04T09:54:40.9237658Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9237952Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9238172Z 2025-12-04T09:54:40.9238260Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9238465Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9238957Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9239393Z warnings.warn( 2025-12-04T09:54:40.9240164Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9240944Z b_grad = a.grad 2025-12-04T09:54:40.9241720Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9242517Z c_grad = a.grad 2025-12-04T09:54:40.9242691Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9242896Z Traceback (most recent call last): 2025-12-04T09:54:40.9243137Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9243378Z method(*args, **kwargs) 2025-12-04T09:54:40.9243573Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9243776Z stack.enter_context(ctx) 2025-12-04T09:54:40.9243952Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9244139Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9244316Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9244499Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9244621Z AttributeError: args 2025-12-04T09:54:40.9244688Z 2025-12-04T09:54:40.9244763Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9245058Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9245291Z 2025-12-04T09:54:40.9245383Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9245583Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9246127Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9246564Z warnings.warn( 2025-12-04T09:54:40.9247356Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9248134Z b_grad = a.grad 2025-12-04T09:54:40.9248900Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9249681Z c_grad = a.grad 2025-12-04T09:54:40.9249856Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9250059Z Traceback (most recent call last): 2025-12-04T09:54:40.9250298Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9250532Z method(*args, **kwargs) 2025-12-04T09:54:40.9250724Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9250947Z stack.enter_context(ctx) 2025-12-04T09:54:40.9251124Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9251310Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9251481Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9251656Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9251775Z AttributeError: args 2025-12-04T09:54:40.9251837Z 2025-12-04T09:54:40.9251912Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9252202Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9252428Z 2025-12-04T09:54:40.9252518Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9252716Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9253186Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9253615Z warnings.warn( 2025-12-04T09:54:40.9254380Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9255176Z b_grad = a.grad 2025-12-04T09:54:40.9256000Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9256786Z c_grad = a.grad 2025-12-04T09:54:40.9256963Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9257181Z Traceback (most recent call last): 2025-12-04T09:54:40.9257419Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9257657Z method(*args, **kwargs) 2025-12-04T09:54:40.9257876Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9258361Z stack.enter_context(ctx) 2025-12-04T09:54:40.9258572Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9258787Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9259028Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9259241Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9259412Z AttributeError: args 2025-12-04T09:54:40.9259512Z 2025-12-04T09:54:40.9259598Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9259929Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9264002Z 2025-12-04T09:54:40.9264104Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9264355Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9264831Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9265266Z warnings.warn( 2025-12-04T09:54:40.9266095Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9266874Z b_grad = a.grad 2025-12-04T09:54:40.9267636Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9268432Z c_grad = a.grad 2025-12-04T09:54:40.9268603Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9268804Z Traceback (most recent call last): 2025-12-04T09:54:40.9269054Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9269292Z method(*args, **kwargs) 2025-12-04T09:54:40.9269483Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9269681Z stack.enter_context(ctx) 2025-12-04T09:54:40.9269852Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9270033Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9270207Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9270383Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9270508Z AttributeError: args 2025-12-04T09:54:40.9270574Z 2025-12-04T09:54:40.9270664Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9270956Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9271179Z 2025-12-04T09:54:40.9271267Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9271469Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9271940Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9272372Z warnings.warn( 2025-12-04T09:54:40.9273142Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9273939Z b_grad = a.grad 2025-12-04T09:54:40.9274696Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9275469Z c_grad = a.grad 2025-12-04T09:54:40.9275639Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9275844Z Traceback (most recent call last): 2025-12-04T09:54:40.9276124Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9276357Z method(*args, **kwargs) 2025-12-04T09:54:40.9276544Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9276743Z stack.enter_context(ctx) 2025-12-04T09:54:40.9276917Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9277100Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9277271Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9277467Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9277590Z AttributeError: args 2025-12-04T09:54:40.9277653Z 2025-12-04T09:54:40.9277727Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9278032Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9278253Z 2025-12-04T09:54:40.9278342Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9278540Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9279006Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9279436Z warnings.warn( 2025-12-04T09:54:40.9280215Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9280988Z b_grad = a.grad 2025-12-04T09:54:40.9281747Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9282523Z c_grad = a.grad 2025-12-04T09:54:40.9282707Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9282904Z Traceback (most recent call last): 2025-12-04T09:54:40.9283134Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9283366Z method(*args, **kwargs) 2025-12-04T09:54:40.9283551Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9283748Z stack.enter_context(ctx) 2025-12-04T09:54:40.9283914Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9284093Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9284263Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9284439Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9284556Z AttributeError: args 2025-12-04T09:54:40.9284618Z 2025-12-04T09:54:40.9284694Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9284983Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9285199Z 2025-12-04T09:54:40.9285284Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9285479Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9285978Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9286424Z warnings.warn( 2025-12-04T09:54:40.9287221Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9287995Z b_grad = a.grad 2025-12-04T09:54:40.9288778Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9289553Z c_grad = a.grad 2025-12-04T09:54:40.9289721Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9289917Z Traceback (most recent call last): 2025-12-04T09:54:40.9290148Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9290378Z method(*args, **kwargs) 2025-12-04T09:54:40.9290563Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9290759Z stack.enter_context(ctx) 2025-12-04T09:54:40.9290928Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9291109Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9291277Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9291452Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9291590Z AttributeError: args 2025-12-04T09:54:40.9291655Z 2025-12-04T09:54:40.9291729Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9292014Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9292230Z 2025-12-04T09:54:40.9292317Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9292512Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9292979Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9293401Z warnings.warn( 2025-12-04T09:54:40.9294168Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9294947Z b_grad = a.grad 2025-12-04T09:54:40.9295739Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9296566Z c_grad = a.grad 2025-12-04T09:54:40.9296733Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9296929Z Traceback (most recent call last): 2025-12-04T09:54:40.9297159Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9297388Z method(*args, **kwargs) 2025-12-04T09:54:40.9297572Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9297768Z stack.enter_context(ctx) 2025-12-04T09:54:40.9297951Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9298129Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9298298Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9298475Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9298591Z AttributeError: args 2025-12-04T09:54:40.9298654Z 2025-12-04T09:54:40.9298726Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9299015Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9299232Z 2025-12-04T09:54:40.9299317Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9299511Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9299980Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9300423Z warnings.warn( 2025-12-04T09:54:40.9301188Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9301962Z b_grad = a.grad 2025-12-04T09:54:40.9302724Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9303494Z c_grad = a.grad 2025-12-04T09:54:40.9303660Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9303856Z Traceback (most recent call last): 2025-12-04T09:54:40.9304087Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9304333Z method(*args, **kwargs) 2025-12-04T09:54:40.9304520Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9304715Z stack.enter_context(ctx) 2025-12-04T09:54:40.9304884Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9305080Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9305249Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9305423Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9305538Z AttributeError: args 2025-12-04T09:54:40.9305601Z 2025-12-04T09:54:40.9305673Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9306019Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9306233Z 2025-12-04T09:54:40.9306320Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9306534Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9307001Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9307424Z warnings.warn( 2025-12-04T09:54:40.9308191Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9308962Z b_grad = a.grad 2025-12-04T09:54:40.9309717Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9310509Z c_grad = a.grad 2025-12-04T09:54:40.9310674Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9310871Z Traceback (most recent call last): 2025-12-04T09:54:40.9311101Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9311332Z method(*args, **kwargs) 2025-12-04T09:54:40.9311517Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9311711Z stack.enter_context(ctx) 2025-12-04T09:54:40.9311884Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9312063Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9312231Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9312403Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9312519Z AttributeError: args 2025-12-04T09:54:40.9312585Z 2025-12-04T09:54:40.9312657Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9312945Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9313177Z 2025-12-04T09:54:40.9313264Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9313461Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9313944Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9314373Z warnings.warn( 2025-12-04T09:54:40.9315150Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9315969Z b_grad = a.grad 2025-12-04T09:54:40.9316724Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9317490Z c_grad = a.grad 2025-12-04T09:54:40.9317657Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9317855Z Traceback (most recent call last): 2025-12-04T09:54:40.9318085Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9318315Z method(*args, **kwargs) 2025-12-04T09:54:40.9318498Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9318726Z stack.enter_context(ctx) 2025-12-04T09:54:40.9318894Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9319071Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9319239Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9319413Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9319531Z AttributeError: args 2025-12-04T09:54:40.9319591Z 2025-12-04T09:54:40.9319664Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9319953Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9320170Z 2025-12-04T09:54:40.9320256Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9320452Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9320922Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9321344Z warnings.warn( 2025-12-04T09:54:40.9322107Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9322893Z b_grad = a.grad 2025-12-04T09:54:40.9323664Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9324433Z c_grad = a.grad 2025-12-04T09:54:40.9324601Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9324816Z Traceback (most recent call last): 2025-12-04T09:54:40.9325047Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9325277Z method(*args, **kwargs) 2025-12-04T09:54:40.9325462Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9325655Z stack.enter_context(ctx) 2025-12-04T09:54:40.9325819Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9326039Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9326208Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9326383Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9326498Z AttributeError: args 2025-12-04T09:54:40.9326560Z 2025-12-04T09:54:40.9326632Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9326921Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9327136Z 2025-12-04T09:54:40.9327222Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9327444Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9327913Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9328337Z warnings.warn( 2025-12-04T09:54:40.9329103Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9329877Z b_grad = a.grad 2025-12-04T09:54:40.9330635Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9331431Z c_grad = a.grad 2025-12-04T09:54:40.9331597Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9331794Z Traceback (most recent call last): 2025-12-04T09:54:40.9332046Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9332279Z method(*args, **kwargs) 2025-12-04T09:54:40.9332461Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9332658Z stack.enter_context(ctx) 2025-12-04T09:54:40.9332827Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9333004Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9333172Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9333347Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9333464Z AttributeError: args 2025-12-04T09:54:40.9333526Z 2025-12-04T09:54:40.9333622Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9333907Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9334124Z 2025-12-04T09:54:40.9334213Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9334408Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9334875Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9335299Z warnings.warn( 2025-12-04T09:54:40.9336094Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9336895Z b_grad = a.grad 2025-12-04T09:54:40.9337659Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9338426Z c_grad = a.grad 2025-12-04T09:54:40.9338590Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9338787Z Traceback (most recent call last): 2025-12-04T09:54:40.9339016Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9339244Z method(*args, **kwargs) 2025-12-04T09:54:40.9339428Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9339623Z stack.enter_context(ctx) 2025-12-04T09:54:40.9339790Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9339967Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9340159Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9340335Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9340451Z AttributeError: args 2025-12-04T09:54:40.9340511Z 2025-12-04T09:54:40.9340583Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9340893Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9341108Z 2025-12-04T09:54:40.9341193Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9341386Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9341746Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9341784Z warnings.warn( 2025-12-04T09:54:40.9342521Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9342558Z b_grad = a.grad 2025-12-04T09:54:40.9343264Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9343302Z c_grad = a.grad 2025-12-04T09:54:40.9343431Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9343477Z Traceback (most recent call last): 2025-12-04T09:54:40.9343631Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9343673Z method(*args, **kwargs) 2025-12-04T09:54:40.9343790Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9343833Z stack.enter_context(ctx) 2025-12-04T09:54:40.9343932Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9343977Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9344072Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9344117Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9344156Z AttributeError: args 2025-12-04T09:54:40.9344160Z 2025-12-04T09:54:40.9344234Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9344413Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9344415Z 2025-12-04T09:54:40.9344501Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9344576Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9344937Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9344985Z warnings.warn( 2025-12-04T09:54:40.9345708Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9345747Z b_grad = a.grad 2025-12-04T09:54:40.9346520Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9346562Z c_grad = a.grad 2025-12-04T09:54:40.9346677Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9346723Z Traceback (most recent call last): 2025-12-04T09:54:40.9346876Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9346915Z method(*args, **kwargs) 2025-12-04T09:54:40.9347033Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9347077Z stack.enter_context(ctx) 2025-12-04T09:54:40.9347178Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9347223Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9347315Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9347384Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9347425Z AttributeError: args 2025-12-04T09:54:40.9347427Z 2025-12-04T09:54:40.9347500Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9347678Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9347682Z 2025-12-04T09:54:40.9347767Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9347841Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9348200Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9348239Z warnings.warn( 2025-12-04T09:54:40.9348952Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9348991Z b_grad = a.grad 2025-12-04T09:54:40.9349714Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9349765Z c_grad = a.grad 2025-12-04T09:54:40.9349878Z _ TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel _ 2025-12-04T09:54:40.9349922Z Traceback (most recent call last): 2025-12-04T09:54:40.9350076Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3329, in wrapper 2025-12-04T09:54:40.9350115Z method(*args, **kwargs) 2025-12-04T09:54:40.9350234Z File "/var/lib/jenkins/pytorch/test/inductor/test_compiled_autograd.py", line 5058, in wrapped 2025-12-04T09:54:40.9350293Z stack.enter_context(ctx) 2025-12-04T09:54:40.9350393Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 492, in enter_context 2025-12-04T09:54:40.9350436Z result = _cm_type.__enter__(cm) 2025-12-04T09:54:40.9350533Z File "/opt/conda/envs/py_3.10/lib/python3.10/contextlib.py", line 133, in __enter__ 2025-12-04T09:54:40.9350577Z del self.args, self.kwds, self.func 2025-12-04T09:54:40.9350617Z AttributeError: args 2025-12-04T09:54:40.9350619Z 2025-12-04T09:54:40.9350692Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9350871Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9350873Z 2025-12-04T09:54:40.9350958Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9351035Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:54:40.9351395Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance. 2025-12-04T09:54:40.9351447Z warnings.warn( 2025-12-04T09:54:40.9352154Z /var/lib/jenkins/pytorch/test/test_autograd.py:7724: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9352190Z b_grad = a.grad 2025-12-04T09:54:40.9352893Z /var/lib/jenkins/pytorch/test/test_autograd.py:7731: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more information. (Triggered internally at /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:492.) 2025-12-04T09:54:40.9352928Z c_grad = a.grad 2025-12-04T09:54:40.9353160Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_compiled_autograd/inductor.test_compiled_autograd-420ab70ad85293fc.xml - 2025-12-04T09:54:40.9353221Z =========================== short test summary info ============================ 2025-12-04T09:54:40.9353461Z FAILED [0.0016s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9353464Z 2025-12-04T09:54:40.9353538Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9353729Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9353733Z 2025-12-04T09:54:40.9353820Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9354030Z FAILED [0.0008s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9354032Z 2025-12-04T09:54:40.9354104Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9354279Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9354282Z 2025-12-04T09:54:40.9354375Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9354581Z FAILED [0.0008s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9354585Z 2025-12-04T09:54:40.9354656Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9354834Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9354836Z 2025-12-04T09:54:40.9354917Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9355124Z FAILED [0.0007s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9355127Z 2025-12-04T09:54:40.9355196Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9355373Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9355375Z 2025-12-04T09:54:40.9355459Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9355677Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9355680Z 2025-12-04T09:54:40.9355748Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9355963Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9355965Z 2025-12-04T09:54:40.9356047Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9356256Z FAILED [0.0005s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9356257Z 2025-12-04T09:54:40.9356328Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9356504Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9356505Z 2025-12-04T09:54:40.9356590Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9356795Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9356797Z 2025-12-04T09:54:40.9356866Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9357043Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9357065Z 2025-12-04T09:54:40.9357150Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9357356Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9357371Z 2025-12-04T09:54:40.9357442Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9357616Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9357620Z 2025-12-04T09:54:40.9357701Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9357910Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9357914Z 2025-12-04T09:54:40.9357984Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9358175Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9358177Z 2025-12-04T09:54:40.9358261Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9358468Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9358470Z 2025-12-04T09:54:40.9358538Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9358714Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9358716Z 2025-12-04T09:54:40.9358797Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9359006Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9359008Z 2025-12-04T09:54:40.9359078Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9359268Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9359270Z 2025-12-04T09:54:40.9359353Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9359558Z FAILED [0.0005s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9359560Z 2025-12-04T09:54:40.9359630Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9359804Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9359807Z 2025-12-04T09:54:40.9359891Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9360096Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9360100Z 2025-12-04T09:54:40.9360172Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9360348Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9360353Z 2025-12-04T09:54:40.9360435Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9360643Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9360656Z 2025-12-04T09:54:40.9360725Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9360904Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9360905Z 2025-12-04T09:54:40.9360999Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9361208Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9361211Z 2025-12-04T09:54:40.9361280Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9361457Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9361459Z 2025-12-04T09:54:40.9361541Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9361760Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9361762Z 2025-12-04T09:54:40.9361833Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9362009Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9362011Z 2025-12-04T09:54:40.9362095Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9362299Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9362301Z 2025-12-04T09:54:40.9362371Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9362545Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9362548Z 2025-12-04T09:54:40.9362632Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9362837Z FAILED [0.0005s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9362853Z 2025-12-04T09:54:40.9362923Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9363097Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9363100Z 2025-12-04T09:54:40.9363182Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9363391Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9363394Z 2025-12-04T09:54:40.9363466Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9363644Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9363646Z 2025-12-04T09:54:40.9363731Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9363937Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9363939Z 2025-12-04T09:54:40.9364008Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9364187Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9364188Z 2025-12-04T09:54:40.9364269Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9364491Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9364493Z 2025-12-04T09:54:40.9364563Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9364757Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9364758Z 2025-12-04T09:54:40.9364844Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9365052Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9365054Z 2025-12-04T09:54:40.9365127Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9365304Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9365321Z 2025-12-04T09:54:40.9365408Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9365616Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9365619Z 2025-12-04T09:54:40.9365691Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9365868Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9365869Z 2025-12-04T09:54:40.9366011Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9366221Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9366224Z 2025-12-04T09:54:40.9366295Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9366479Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9366481Z 2025-12-04T09:54:40.9366584Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9366792Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9366795Z 2025-12-04T09:54:40.9366863Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9367040Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9367042Z 2025-12-04T09:54:40.9367124Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9367335Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9367337Z 2025-12-04T09:54:40.9367407Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9367586Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9367588Z 2025-12-04T09:54:40.9367673Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9367880Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9367882Z 2025-12-04T09:54:40.9367955Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9368130Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9368147Z 2025-12-04T09:54:40.9368233Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9368455Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9368458Z 2025-12-04T09:54:40.9368531Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9368708Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9368709Z 2025-12-04T09:54:40.9368794Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9369001Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9369006Z 2025-12-04T09:54:40.9369089Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9369267Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9369270Z 2025-12-04T09:54:40.9369353Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9369563Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9369565Z 2025-12-04T09:54:40.9369635Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9369814Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9369816Z 2025-12-04T09:54:40.9369898Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9370108Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9370110Z 2025-12-04T09:54:40.9370180Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9370371Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9370373Z 2025-12-04T09:54:40.9370458Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9370666Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9370668Z 2025-12-04T09:54:40.9370739Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9370915Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9370919Z 2025-12-04T09:54:40.9371006Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9371217Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9371220Z 2025-12-04T09:54:40.9371290Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9371466Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9371468Z 2025-12-04T09:54:40.9371553Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9371760Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9371775Z 2025-12-04T09:54:40.9371846Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9372025Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9372028Z 2025-12-04T09:54:40.9372121Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9372328Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9372330Z 2025-12-04T09:54:40.9372400Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9372578Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9372581Z 2025-12-04T09:54:40.9372663Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9372884Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9372886Z 2025-12-04T09:54:40.9372956Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9373138Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9373140Z 2025-12-04T09:54:40.9373225Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9373433Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9373435Z 2025-12-04T09:54:40.9373508Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9373685Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9373689Z 2025-12-04T09:54:40.9373774Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9373981Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9373995Z 2025-12-04T09:54:40.9374067Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9374245Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9374247Z 2025-12-04T09:54:40.9374332Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9374539Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9374545Z 2025-12-04T09:54:40.9374616Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9374797Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9374800Z 2025-12-04T09:54:40.9374884Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9375091Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9375093Z 2025-12-04T09:54:40.9375163Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9375342Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9375344Z 2025-12-04T09:54:40.9375426Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9375647Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9375649Z 2025-12-04T09:54:40.9375720Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9375910Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9375912Z 2025-12-04T09:54:40.9376041Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9376250Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9376252Z 2025-12-04T09:54:40.9376322Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9376543Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9376546Z 2025-12-04T09:54:40.9376632Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9376840Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9376843Z 2025-12-04T09:54:40.9376914Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9377092Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9377094Z 2025-12-04T09:54:40.9377180Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9377385Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9377388Z 2025-12-04T09:54:40.9377461Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9377639Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9377664Z 2025-12-04T09:54:40.9377748Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9377958Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9377960Z 2025-12-04T09:54:40.9378030Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9378209Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9378211Z 2025-12-04T09:54:40.9378293Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9378504Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9378506Z 2025-12-04T09:54:40.9378575Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9378757Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9378759Z 2025-12-04T09:54:40.9378842Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9379050Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9379052Z 2025-12-04T09:54:40.9379126Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9379321Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9379323Z 2025-12-04T09:54:40.9379407Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9379627Z FAILED [0.0004s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9379631Z 2025-12-04T09:54:40.9379703Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9379881Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9379882Z 2025-12-04T09:54:40.9379967Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9380174Z FAILED [0.0003s] inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel - AttributeError: args 2025-12-04T09:54:40.9380176Z 2025-12-04T09:54:40.9380261Z To execute this test, run the following from the base repo dir: 2025-12-04T09:54:40.9380439Z PYTORCH_TEST_WITH_ROCM=1 python test/test_autograd.py TestAutogradWithCompiledAutograd.test_checkpointing_without_reentrant_dataparallel 2025-12-04T09:54:40.9380444Z 2025-12-04T09:54:40.9380527Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:54:40.9380591Z ========================= 49 failed, 1 passed in 7.91s ========================= 2025-12-04T09:54:40.9380593Z 2025-12-04T09:54:40.9380773Z FINISHED PRINTING LOG FILE of inductor/test_compiled_autograd 1/2 (test/test-reports/inductor.test_compiled_autograd_1.2_0fbf59f36039e870_.log) 2025-12-04T09:54:40.9380775Z 2025-12-04T09:54:40.9380902Z Finished inductor/test_compiled_autograd 1/2 ... [2025-12-04 09:54:40.888642][5636101.395045027], took 0.27min 2025-12-04T09:54:40.9381138Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:54:40.9799569Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:54:40.9811112Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T09:54:40.9811340Z Uploading artifacts took 0.00 seconds 2025-12-04T09:54:40.9811457Z inductor/test_compiled_autograd 1/2 failed! 2025-12-04T09:54:40.9815805Z Running dynamo/test_unspec 1/1 ... [2025-12-04 09:54:40.981187][5636101.487584617] 2025-12-04T09:54:40.9815995Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:54:40.9816911Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_unspec.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:54:40.981481] 2025-12-04T09:54:43.5691264Z 2025-12-04T09:54:43.5692566Z dynamo/test_unspec 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_unspec_1.1_8de5b5607705ff1e_.log 2025-12-04T09:54:43.5693506Z Running 0 items in this shard: 2025-12-04T09:54:43.5693765Z 2025-12-04T09:54:43.5694150Z Finished dynamo/test_unspec 1/1 ... [2025-12-04 09:54:43.568659][5636104.075060601], took 0.04min 2025-12-04T09:54:43.5698510Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:54:43.6602784Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:54:43.6618558Z Running dynamo/test_higher_order_ops 1/1 ... [2025-12-04 09:54:43.661464][5636104.167861578] 2025-12-04T09:54:43.6619278Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:54:43.6621529Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_higher_order_ops.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:54:43.661799] 2025-12-04T09:54:50.3048227Z 2025-12-04T09:54:50.3050352Z dynamo/test_higher_order_ops 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_higher_order_ops_1.1_91a088a094116a02_.log 2025-12-04T09:54:50.3051420Z Running 0 items in this shard: 2025-12-04T09:54:50.3051692Z 2025-12-04T09:54:50.3052116Z Finished dynamo/test_higher_order_ops 1/1 ... [2025-12-04 09:54:50.304416][5636110.810815282], took 0.11min 2025-12-04T09:54:50.3058759Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:54:50.3956741Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:54:50.3974315Z Running inductor/test_flex_attention 1/4 ... [2025-12-04 09:54:50.396960][5636110.903357553] 2025-12-04T09:54:50.3975191Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:54:50.3976835Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_attention.py', '--shard-id=1', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:54:50.397285] 2025-12-04T09:58:53.5673763Z 2025-12-04T09:58:53.5674762Z PRINTING LOG FILE of inductor/test_flex_attention 1/4 (test/test-reports/inductor.test_flex_attention_1.4_583be521806f48fb_.log) 2025-12-04T09:58:53.5699791Z Test results will be stored in test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-8ea7c7770886d406.xml 2025-12-04T09:58:53.5700756Z ============================= test session starts ============================== 2025-12-04T09:58:53.5701531Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T09:58:53.5702213Z cachedir: .pytest_cache 2025-12-04T09:58:53.5702970Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T09:58:53.5703837Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T09:58:53.5704836Z configfile: pytest.ini 2025-12-04T09:58:53.5705598Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T09:58:53.5706552Z collecting ... collected 763 items 2025-12-04T09:58:53.5707027Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T09:58:53.5729855Z Running 50 items in this shard: test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda 2025-12-04T09:58:53.5752247Z 2025-12-04T09:58:53.5752754Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda PASSED [6.9882s] [ 2%] 2025-12-04T09:58:53.5753854Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.7262s] [ 2%] 2025-12-04T09:58:53.5754936Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.6440s] [ 2%] 2025-12-04T09:58:53.5756040Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.8129s] [ 2%] 2025-12-04T09:58:53.5757172Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.7262s] [ 2%] 2025-12-04T09:58:53.5758241Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.4215s] [ 2%] 2025-12-04T09:58:53.5759384Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.0950s] [ 2%] 2025-12-04T09:58:53.5760441Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.3377s] [ 2%] 2025-12-04T09:58:53.5761494Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.4796s] [ 2%] 2025-12-04T09:58:53.5762551Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.2016s] [ 2%] 2025-12-04T09:58:53.5763764Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.4079s] [ 2%] 2025-12-04T09:58:53.5764880Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.3684s] [ 2%] 2025-12-04T09:58:53.5766008Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.3639s] [ 2%] 2025-12-04T09:58:53.5767080Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.3081s] [ 2%] 2025-12-04T09:58:53.5768162Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.6781s] [ 2%] 2025-12-04T09:58:53.5769210Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.6441s] [ 2%] 2025-12-04T09:58:53.5770293Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.6391s] [ 2%] 2025-12-04T09:58:53.5799405Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.8884s] [ 2%] 2025-12-04T09:58:53.5800567Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.5969s] [ 2%] 2025-12-04T09:58:53.5801619Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.6290s] [ 2%] 2025-12-04T09:58:53.5802778Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.4493s] [ 2%] 2025-12-04T09:58:53.5803839Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.2069s] [ 2%] 2025-12-04T09:58:53.5804912Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.7474s] [ 2%] 2025-12-04T09:58:53.5806049Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.2224s] [ 2%] 2025-12-04T09:58:53.5807106Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [5.0817s] [ 2%] 2025-12-04T09:58:53.5808169Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.4318s] [ 2%] 2025-12-04T09:58:53.5809215Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.6751s] [ 2%] 2025-12-04T09:58:53.5810298Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.3797s] [ 2%] 2025-12-04T09:58:53.5811383Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.8819s] [ 2%] 2025-12-04T09:58:53.5812441Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.5386s] [ 2%] 2025-12-04T09:58:53.5813507Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.4898s] [ 2%] 2025-12-04T09:58:53.5814567Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [6.0320s] [ 2%] 2025-12-04T09:58:53.5815714Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.9406s] [ 2%] 2025-12-04T09:58:53.5816868Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.4893s] [ 2%] 2025-12-04T09:58:53.5818018Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.7407s] [ 2%] 2025-12-04T09:58:53.5819072Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.4628s] [ 2%] 2025-12-04T09:58:53.5820147Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.8807s] [ 2%] 2025-12-04T09:58:53.5821210Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.7286s] [ 2%] 2025-12-04T09:58:53.5822280Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.6135s] [ 2%] 2025-12-04T09:58:53.5823428Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [5.0149s] [ 2%] 2025-12-04T09:58:53.5824520Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.4073s] [ 2%] 2025-12-04T09:58:53.5825594Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.8554s] [ 2%] 2025-12-04T09:58:53.5826745Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.8514s] [ 2%] 2025-12-04T09:58:53.5827802Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.6409s] [ 2%] 2025-12-04T09:58:53.5828875Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.7760s] [ 2%] 2025-12-04T09:58:53.5829936Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.7026s] [ 2%] 2025-12-04T09:58:53.5831000Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.8861s] [ 2%] 2025-12-04T09:58:53.5832114Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.6108s] [ 2%] 2025-12-04T09:58:53.5833184Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.8224s] [ 2%] 2025-12-04T09:58:53.5834304Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [4.6622s] [ 2%] 2025-12-04T09:58:53.5834905Z 2025-12-04T09:58:53.5835108Z =================================== FAILURES =================================== 2025-12-04T09:58:53.5835744Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.5836442Z Traceback (most recent call last): 2025-12-04T09:58:53.5837228Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.5837988Z self.assertTrue( 2025-12-04T09:58:53.5838572Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.5839233Z raise self.failureException(msg) 2025-12-04T09:58:53.5839941Z AssertionError: False is not true : Log file /tmp/tmp18_m9leu/flex_attention_configs.json was not created 2025-12-04T09:58:53.5840482Z 2025-12-04T09:58:53.5840756Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.5841687Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.5842353Z 2025-12-04T09:58:53.5842674Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.5843378Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.5843912Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.5844291Z unimplemented [] 2025-12-04T09:58:53.5844777Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.5847183Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.5849593Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.5850193Z graph_break [] 2025-12-04T09:58:53.5850646Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.5852723Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.5854652Z current_size = base.storage().size() 2025-12-04T09:58:53.5855092Z Autotune Choices Stats: 2025-12-04T09:58:53.5857891Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.5860927Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.5861871Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.5863017Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.5865771Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5870029Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5874186Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5878546Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.5882740Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5887006Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5891168Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5895380Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5899576Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.5903718Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.5906376Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.5907082Z Autotune Choices Stats: 2025-12-04T09:58:53.5909902Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.5913333Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.5914750Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.5916436Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.5919622Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5923966Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5928295Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5932560Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5937086Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5941471Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5945750Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5950086Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.5954441Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5958765Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5961430Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.5962265Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.5962811Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.5963175Z unimplemented [] 2025-12-04T09:58:53.5963587Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.5964301Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.5966818Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.5968963Z graph_break [] 2025-12-04T09:58:53.5969392Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.5969921Z Autotune Choices Stats: 2025-12-04T09:58:53.5972695Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.5975696Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.5976742Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.5977816Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.5980750Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5984993Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.5989156Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.5993283Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.5997653Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6001821Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6006020Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6010156Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6014315Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6018506Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6021097Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.6021784Z Autotune Choices Stats: 2025-12-04T09:58:53.6024594Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.6028135Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6029547Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6031214Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6034364Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6038698Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6043030Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6093940Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6098563Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6103200Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6107842Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6112441Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6117062Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6121552Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6124571Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.6125588Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.6126390Z Traceback (most recent call last): 2025-12-04T09:58:53.6127536Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.6128491Z self.assertTrue( 2025-12-04T09:58:53.6129209Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.6129967Z raise self.failureException(msg) 2025-12-04T09:58:53.6130815Z AssertionError: False is not true : Log file /tmp/tmpfn7vxgza/flex_attention_configs.json was not created 2025-12-04T09:58:53.6131427Z 2025-12-04T09:58:53.6131776Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.6132934Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.6133654Z 2025-12-04T09:58:53.6134104Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.6135028Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.6135773Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.6136392Z unimplemented [] 2025-12-04T09:58:53.6137065Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.6139565Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.6142230Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.6143028Z graph_break [] 2025-12-04T09:58:53.6143602Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.6145805Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.6147952Z current_size = base.storage().size() 2025-12-04T09:58:53.6148571Z Autotune Choices Stats: 2025-12-04T09:58:53.6151393Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.6154755Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6155907Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6157231Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6160195Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6164620Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6169355Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6173771Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6178130Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6182586Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6187108Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6191552Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6195899Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6200284Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6203096Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.6203928Z Autotune Choices Stats: 2025-12-04T09:58:53.6206918Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.6210465Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6212102Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6213897Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6217365Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6221958Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6226592Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6231272Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6235742Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6240281Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6245013Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6249717Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6254247Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6258763Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6261690Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.6262738Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.6263434Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.6264095Z unimplemented [] 2025-12-04T09:58:53.6264666Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.6265522Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.6268249Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.6270699Z graph_break [] 2025-12-04T09:58:53.6271281Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.6272033Z Autotune Choices Stats: 2025-12-04T09:58:53.6274209Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.6276296Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6277070Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6277902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6279733Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6282504Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6285307Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6288171Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6291004Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6293794Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6296642Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6299460Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6302196Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6304944Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6307015Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.6307601Z Autotune Choices Stats: 2025-12-04T09:58:53.6309496Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.6311787Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6312765Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6313894Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6316220Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6319151Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6321975Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6324862Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6327900Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6330906Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6333761Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6336678Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6339608Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6342483Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6344290Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.6344931Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.6345423Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.6345863Z unimplemented [] 2025-12-04T09:58:53.6346306Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.6346933Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.6348615Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.6350148Z graph_break [] 2025-12-04T09:58:53.6350526Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.6350938Z Autotune Choices Stats: 2025-12-04T09:58:53.6352777Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.6354793Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6355576Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6356391Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6358230Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6361087Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6363844Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6366700Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6456669Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6460157Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6463338Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6466365Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6469222Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6472004Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6473729Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.6474285Z Autotune Choices Stats: 2025-12-04T09:58:53.6476203Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.6482860Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6483818Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6484953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6487517Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6490675Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6493711Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6496677Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6499535Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6502427Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6505243Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6508102Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6510900Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6513734Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6515600Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.6516246Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.6516706Z Traceback (most recent call last): 2025-12-04T09:58:53.6517260Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.6517786Z self.assertTrue( 2025-12-04T09:58:53.6518238Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.6518688Z raise self.failureException(msg) 2025-12-04T09:58:53.6519195Z AssertionError: False is not true : Log file /tmp/tmp0m7dzmdb/flex_attention_configs.json was not created 2025-12-04T09:58:53.6519584Z 2025-12-04T09:58:53.6519794Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.6520422Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.6520906Z 2025-12-04T09:58:53.6521230Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.6521832Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.6522368Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.6522750Z unimplemented [] 2025-12-04T09:58:53.6523129Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.6524762Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.6526475Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.6526970Z graph_break [] 2025-12-04T09:58:53.6527411Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.6528854Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.6530227Z current_size = base.storage().size() 2025-12-04T09:58:53.6530627Z Autotune Choices Stats: 2025-12-04T09:58:53.6532510Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.6534675Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6535388Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6536482Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6538364Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6541093Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6543997Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6546930Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6549713Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6552545Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6555348Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6558313Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6561187Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6564027Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6565790Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.6566461Z Autotune Choices Stats: 2025-12-04T09:58:53.6568493Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.6570797Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6571804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6572948Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6575113Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6578253Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6581083Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6584010Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6586951Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6589859Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6592708Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6595590Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6598478Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6601450Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6603339Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.6604025Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.6604515Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.6604901Z unimplemented [] 2025-12-04T09:58:53.6605357Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.6605998Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.6607706Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.6609303Z graph_break [] 2025-12-04T09:58:53.6609702Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.6610137Z Autotune Choices Stats: 2025-12-04T09:58:53.6612199Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.6614267Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6615022Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6615905Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6617799Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6620528Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6623353Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6626292Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6629021Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6631775Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6634789Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6637628Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6640377Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6643219Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6645024Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.6645581Z Autotune Choices Stats: 2025-12-04T09:58:53.6647533Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.6650064Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6651069Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6652245Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6654385Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6657292Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6660170Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6663077Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6666081Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6668945Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6672131Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6675081Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6677980Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6680874Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6682686Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.6683367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.6683892Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.6684291Z unimplemented [] 2025-12-04T09:58:53.6684634Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.6685167Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.6686944Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.6688401Z graph_break [] 2025-12-04T09:58:53.6688867Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.6689306Z Autotune Choices Stats: 2025-12-04T09:58:53.6691175Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.6693239Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6693952Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6694742Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6696657Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6699498Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6702358Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6703988Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6705488Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6706994Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6708265Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6709572Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6710901Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6712185Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6712989Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.6728741Z Autotune Choices Stats: 2025-12-04T09:58:53.6729647Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.6730733Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6731195Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6731726Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6732719Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6734061Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6735397Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6736756Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6738080Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6900043Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6901612Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6902917Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6904206Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6905529Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6906371Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.6906627Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.6906791Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.6906910Z unimplemented [] 2025-12-04T09:58:53.6907038Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.6907243Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.6907959Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.6908654Z graph_break [] 2025-12-04T09:58:53.6908794Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.6908954Z Autotune Choices Stats: 2025-12-04T09:58:53.6909790Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.6910689Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6910992Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6911313Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6912132Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6913387Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6914654Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6915895Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6917174Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6918453Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6919697Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6920936Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6922172Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6923429Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6924201Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.6924413Z Autotune Choices Stats: 2025-12-04T09:58:53.6925243Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.6926299Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6926748Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6927239Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6928200Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6929482Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6930772Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6932072Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6933355Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6934654Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6935996Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6937293Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6938580Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6939852Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6940659Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.6940926Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.6941108Z Traceback (most recent call last): 2025-12-04T09:58:53.6941343Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.6941572Z self.assertTrue( 2025-12-04T09:58:53.6941753Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.6941946Z raise self.failureException(msg) 2025-12-04T09:58:53.6942161Z AssertionError: False is not true : Log file /tmp/tmpkot2t0ts/flex_attention_configs.json was not created 2025-12-04T09:58:53.6942328Z 2025-12-04T09:58:53.6942406Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.6942684Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.6942888Z 2025-12-04T09:58:53.6942982Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.6943187Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.6943349Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.6943464Z unimplemented [] 2025-12-04T09:58:53.6943591Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.6944296Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.6945010Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.6945188Z graph_break [] 2025-12-04T09:58:53.6945323Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.6945996Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.6946565Z current_size = base.storage().size() 2025-12-04T09:58:53.6946693Z Autotune Choices Stats: 2025-12-04T09:58:53.6947507Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.6948405Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6948686Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6949021Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6949838Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6951081Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6952325Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6953591Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6954844Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6956121Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6957355Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6958613Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6959863Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6961097Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6961875Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.6962083Z Autotune Choices Stats: 2025-12-04T09:58:53.6962922Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.6963945Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6964376Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6964857Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6965799Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6967117Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6968429Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6969710Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6971022Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6972320Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6973599Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6974880Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6976840Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6978128Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6978922Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.6979170Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.6979330Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.6979448Z unimplemented [] 2025-12-04T09:58:53.6979579Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.6979778Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.6980543Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.6981187Z graph_break [] 2025-12-04T09:58:53.6981322Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.6981478Z Autotune Choices Stats: 2025-12-04T09:58:53.6982300Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.6983208Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6983490Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.6983802Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.6984614Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6985875Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6987479Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6988714Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6989987Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6991242Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.6992480Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6993723Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6994977Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6996357Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.6997141Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.6997350Z Autotune Choices Stats: 2025-12-04T09:58:53.6998199Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.6999223Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.6999646Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7000144Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7001090Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7002362Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7003651Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7004929Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7006330Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7007649Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7008947Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7010230Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7011509Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7012799Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7013590Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.7013837Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7013998Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7014115Z unimplemented [] 2025-12-04T09:58:53.7014240Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7014442Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7015160Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.7015822Z graph_break [] 2025-12-04T09:58:53.7015996Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7016154Z Autotune Choices Stats: 2025-12-04T09:58:53.7016976Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.7017881Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7018176Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7018490Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7019312Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7020554Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7021804Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7023050Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7024292Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7025549Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7026837Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7028085Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7029318Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7030576Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7031351Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.7031560Z Autotune Choices Stats: 2025-12-04T09:58:53.7032392Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.7033400Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7033855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7034340Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7035304Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7036637Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7037926Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7039220Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7040503Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7041804Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7043115Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7044411Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7045712Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7047046Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7047860Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.7048108Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7048268Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7048385Z unimplemented [] 2025-12-04T09:58:53.7048514Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7048714Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7049428Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7050075Z graph_break [] 2025-12-04T09:58:53.7050215Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7050373Z Autotune Choices Stats: 2025-12-04T09:58:53.7051194Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.7052113Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7052393Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7052710Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7053534Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7054783Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7056053Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7057307Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7058545Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7059791Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7061059Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7062314Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7063552Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7064800Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7065587Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.7065795Z Autotune Choices Stats: 2025-12-04T09:58:53.7066657Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.7067665Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7068086Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7068575Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7069559Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7070860Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7072145Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7073429Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7074729Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7076052Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7077342Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7078646Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7079947Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7081236Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7082026Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.7082273Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7082435Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7082551Z unimplemented [] 2025-12-04T09:58:53.7082694Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7082895Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7083605Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7084242Z graph_break [] 2025-12-04T09:58:53.7084378Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7084536Z Autotune Choices Stats: 2025-12-04T09:58:53.7085341Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.7086276Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7086577Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7086906Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7087721Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7088981Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7090223Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7091468Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7092715Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7093959Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7095212Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7096513Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7097767Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7099006Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7099784Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.7099993Z Autotune Choices Stats: 2025-12-04T09:58:53.7100818Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.7101832Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7102253Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7102734Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7103686Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7104998Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7106325Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7107614Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7108901Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7110209Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7111501Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7112789Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7114110Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7115406Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7116227Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.7116489Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.7116668Z Traceback (most recent call last): 2025-12-04T09:58:53.7116900Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.7117129Z self.assertTrue( 2025-12-04T09:58:53.7117301Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.7117494Z raise self.failureException(msg) 2025-12-04T09:58:53.7117710Z AssertionError: False is not true : Log file /tmp/tmppih0duwj/flex_attention_configs.json was not created 2025-12-04T09:58:53.7117879Z 2025-12-04T09:58:53.7117978Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.7118257Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.7118455Z 2025-12-04T09:58:53.7118551Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.7118755Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7118915Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7119028Z unimplemented [] 2025-12-04T09:58:53.7119155Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7119832Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.7120542Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7120720Z graph_break [] 2025-12-04T09:58:53.7120856Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7121455Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.7122037Z current_size = base.storage().size() 2025-12-04T09:58:53.7122164Z Autotune Choices Stats: 2025-12-04T09:58:53.7122985Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.7123887Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7124182Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7124502Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7125311Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7126587Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7127841Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7129090Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7130338Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7131607Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7132861Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7134106Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7135340Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7136630Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7137412Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.7137622Z Autotune Choices Stats: 2025-12-04T09:58:53.7138441Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.7139447Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7139896Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7140380Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7141354Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7142634Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7143913Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7145209Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7146550Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7147833Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7149152Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7150445Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7151732Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7153015Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7153819Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.7154064Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7154223Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7154339Z unimplemented [] 2025-12-04T09:58:53.7154465Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7154667Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7155384Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7156068Z graph_break [] 2025-12-04T09:58:53.7156205Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7156363Z Autotune Choices Stats: 2025-12-04T09:58:53.7157185Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.7158096Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7158383Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7158702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7159530Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7160780Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7162023Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7163269Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7164512Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7165756Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7167050Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7168301Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7169549Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7170788Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7171582Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.7171792Z Autotune Choices Stats: 2025-12-04T09:58:53.7172622Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.7173639Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7174066Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7174551Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7175529Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7176869Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7178162Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7179440Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7180752Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7182045Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7183334Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7184650Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7185981Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7187271Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7188078Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.7188325Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7188493Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7188611Z unimplemented [] 2025-12-04T09:58:53.7188735Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7188953Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7189675Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.7190319Z graph_break [] 2025-12-04T09:58:53.7190461Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7190618Z Autotune Choices Stats: 2025-12-04T09:58:53.7191428Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.7192337Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7192638Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7192955Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7193786Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7195050Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7196323Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7197560Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7198829Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7200074Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7201315Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7202581Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7203827Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7205071Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7205850Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.7206099Z Autotune Choices Stats: 2025-12-04T09:58:53.7206932Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.7207967Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7208391Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7208874Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7209829Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7211148Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7212452Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7213739Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7215026Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7216362Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7217661Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7218960Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7220274Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7221572Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7222362Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.7222602Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7222756Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7222866Z unimplemented [] 2025-12-04T09:58:53.7222987Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7223182Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7223895Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7224553Z graph_break [] 2025-12-04T09:58:53.7224684Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7224837Z Autotune Choices Stats: 2025-12-04T09:58:53.7225640Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.7226569Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7226846Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7227158Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7227995Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7229255Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7230506Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7231749Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7232995Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7234247Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7235488Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7236772Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7238093Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7239351Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7240126Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.7240341Z Autotune Choices Stats: 2025-12-04T09:58:53.7241164Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.7242196Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7242617Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7243098Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7244047Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7245341Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7246709Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7248007Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7249298Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7250600Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7251894Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7253184Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7254484Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7258736Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7259531Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.7259792Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7259956Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7260075Z unimplemented [] 2025-12-04T09:58:53.7260204Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7260429Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7261145Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7261789Z graph_break [] 2025-12-04T09:58:53.7261927Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7262084Z Autotune Choices Stats: 2025-12-04T09:58:53.7262899Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.7263817Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7264102Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7264421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7265244Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7266547Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7267859Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7269116Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7270362Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7271610Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7272875Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7274120Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7275381Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7276671Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7277447Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.7277674Z Autotune Choices Stats: 2025-12-04T09:58:53.7278498Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.7279509Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7279937Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7280446Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7281400Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7282685Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7283972Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7285264Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7286624Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7287920Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7289225Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7290525Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7291811Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7293126Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7293937Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.7294185Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7294348Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7294464Z unimplemented [] 2025-12-04T09:58:53.7294591Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7294793Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7295517Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7296204Z graph_break [] 2025-12-04T09:58:53.7296343Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7296501Z Autotune Choices Stats: 2025-12-04T09:58:53.7297306Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:53.7298235Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7298522Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7298841Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7299658Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7300908Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7302189Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7303456Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7304717Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7305998Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7307256Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7308520Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7309763Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7311026Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7311821Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:53.7312036Z Autotune Choices Stats: 2025-12-04T09:58:53.7312884Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:53.7313893Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7314324Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7314816Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7315778Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7317116Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7318402Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7319719Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7321014Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7322312Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7323616Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7324913Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7326254Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7327549Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7328342Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:53.7328609Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.7328796Z Traceback (most recent call last): 2025-12-04T09:58:53.7329030Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.7329290Z self.assertTrue( 2025-12-04T09:58:53.7329464Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.7329657Z raise self.failureException(msg) 2025-12-04T09:58:53.7329877Z AssertionError: False is not true : Log file /tmp/tmpho6dwhch/flex_attention_configs.json was not created 2025-12-04T09:58:53.7330042Z 2025-12-04T09:58:53.7330126Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.7330410Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.7330609Z 2025-12-04T09:58:53.7330721Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.7330932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7331100Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7331221Z unimplemented [] 2025-12-04T09:58:53.7331350Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7332027Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.7332742Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7332925Z graph_break [] 2025-12-04T09:58:53.7333068Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7333675Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.7334261Z current_size = base.storage().size() 2025-12-04T09:58:53.7334393Z Autotune Choices Stats: 2025-12-04T09:58:53.7335214Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.7336160Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7336445Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7336769Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7337610Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7338880Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7340138Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7341386Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7342626Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7343891Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7345139Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7346418Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7347699Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7348964Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7349742Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.7349956Z Autotune Choices Stats: 2025-12-04T09:58:53.7350781Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.7351810Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7352235Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7352725Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7353678Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7354963Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7356299Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7357623Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7358908Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7360209Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7361528Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7362813Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7364101Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7365413Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7366251Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.7366517Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7366683Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7366803Z unimplemented [] 2025-12-04T09:58:53.7366935Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7367142Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7367859Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7368504Z graph_break [] 2025-12-04T09:58:53.7368645Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7368807Z Autotune Choices Stats: 2025-12-04T09:58:53.7369611Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.7370528Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7370813Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7371136Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7371957Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7373223Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7374477Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7375739Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7377028Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7378270Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7379532Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7380779Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7382039Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7383298Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7384075Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.7384299Z Autotune Choices Stats: 2025-12-04T09:58:53.7385130Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.7386180Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7386604Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7387106Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7388061Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7389356Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7390645Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7391938Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7393255Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7394547Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7395841Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7397178Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7398469Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7399785Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7400599Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.7400850Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7401016Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7401139Z unimplemented [] 2025-12-04T09:58:53.7401268Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7401474Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7402209Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.7402862Z graph_break [] 2025-12-04T09:58:53.7403007Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7403168Z Autotune Choices Stats: 2025-12-04T09:58:53.7403982Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.7404898Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7405187Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7405507Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7406367Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7407619Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7408875Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7410140Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7411447Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7412705Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7413956Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7415223Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7416515Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7417775Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7418570Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.7418786Z Autotune Choices Stats: 2025-12-04T09:58:53.7419626Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.7420637Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7421066Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7421552Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7422508Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7423825Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7425119Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7426459Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7427761Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7429061Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7430356Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7431648Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7460053Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7461473Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7462286Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.7462543Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7462710Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7462830Z unimplemented [] 2025-12-04T09:58:53.7462959Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7463242Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7463988Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7464636Z graph_break [] 2025-12-04T09:58:53.7464776Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7464955Z Autotune Choices Stats: 2025-12-04T09:58:53.7465782Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.7466738Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7467028Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7467351Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7468193Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7469450Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7470706Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7471994Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7473282Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7474542Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7475800Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7477114Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7478374Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7479625Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7480405Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.7480619Z Autotune Choices Stats: 2025-12-04T09:58:53.7481474Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.7482510Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7482963Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7483455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7484418Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7485721Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7487060Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7488350Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7489671Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7490979Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7492278Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7493576Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7494873Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7496200Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7497000Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.7497248Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7497409Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7497525Z unimplemented [] 2025-12-04T09:58:53.7497649Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7497852Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7498593Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7499257Z graph_break [] 2025-12-04T09:58:53.7499391Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7499549Z Autotune Choices Stats: 2025-12-04T09:58:53.7500376Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.7501282Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7501566Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7501889Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7502708Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7503980Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7505240Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7506518Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7507785Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7509063Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7510317Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7511567Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7512834Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7514077Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7514851Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.7515061Z Autotune Choices Stats: 2025-12-04T09:58:53.7515905Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.7516978Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7517425Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7517912Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7518872Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7520175Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7521477Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7522780Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7524076Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7525387Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7526750Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7528044Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7529341Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7530645Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7531443Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.7531690Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7531853Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7531972Z unimplemented [] 2025-12-04T09:58:53.7532096Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7532300Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7533018Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7533672Z graph_break [] 2025-12-04T09:58:53.7533805Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7533962Z Autotune Choices Stats: 2025-12-04T09:58:53.7534803Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:53.7535722Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7536052Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7536370Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7537187Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7538448Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7539704Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7540951Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7542212Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7543484Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7544762Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7546042Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7547296Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7548555Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7549332Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:53.7549540Z Autotune Choices Stats: 2025-12-04T09:58:53.7550367Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:53.7551393Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7551561Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7551876Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7552540Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7553170Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7553799Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7554445Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7555075Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7555707Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7556396Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7557055Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7557685Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7558315Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7558466Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:53.7558543Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7558592Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7558632Z unimplemented [] 2025-12-04T09:58:53.7558699Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7558803Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7559381Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.7559425Z graph_break [] 2025-12-04T09:58:53.7559501Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7559547Z Autotune Choices Stats: 2025-12-04T09:58:53.7560303Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:53.7560448Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7560577Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7560747Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7561369Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7561977Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7562583Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7563204Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7563815Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7564423Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7565044Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7565677Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7566317Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7566925Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7567092Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:53.7567134Z Autotune Choices Stats: 2025-12-04T09:58:53.7567902Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:53.7568128Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7568295Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7568582Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7569243Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7569904Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7570535Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7571167Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7571815Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7572444Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7573073Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7573717Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7574369Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7574997Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7575133Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:53.7575232Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.7575282Z Traceback (most recent call last): 2025-12-04T09:58:53.7575440Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.7575492Z self.assertTrue( 2025-12-04T09:58:53.7575601Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.7575652Z raise self.failureException(msg) 2025-12-04T09:58:53.7575786Z AssertionError: False is not true : Log file /tmp/tmpbx2xz6g6/flex_attention_configs.json was not created 2025-12-04T09:58:53.7575790Z 2025-12-04T09:58:53.7575869Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.7576075Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.7576078Z 2025-12-04T09:58:53.7576171Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.7576253Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7576299Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7576342Z unimplemented [] 2025-12-04T09:58:53.7576406Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7576996Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.7577099Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7577142Z graph_break [] 2025-12-04T09:58:53.7577219Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7577732Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.7577800Z current_size = base.storage().size() 2025-12-04T09:58:53.7577842Z Autotune Choices Stats: 2025-12-04T09:58:53.7578615Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.7578750Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7578870Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7579034Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7579648Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7580272Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7580886Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7581492Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7582109Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7582746Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7583355Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7583959Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7584572Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7585182Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7585316Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.7585360Z Autotune Choices Stats: 2025-12-04T09:58:53.7586166Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.7586404Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7586586Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7586869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7587530Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7588163Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7588785Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7589428Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7590067Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7590706Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7591356Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7591989Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7592618Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7593255Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7593394Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.7593472Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7593522Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7593562Z unimplemented [] 2025-12-04T09:58:53.7593627Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7593730Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7594312Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7594351Z graph_break [] 2025-12-04T09:58:53.7594430Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7594474Z Autotune Choices Stats: 2025-12-04T09:58:53.7595225Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.7595369Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7595495Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7595660Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7596313Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7596927Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7597544Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7598154Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7598768Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7599385Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7600019Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7600630Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7601235Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7601851Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7601987Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.7602029Z Autotune Choices Stats: 2025-12-04T09:58:53.7602802Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.7603028Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7603194Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7603495Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7604146Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7604774Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7605415Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7606120Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7606765Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7607397Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7608034Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7608692Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7609318Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7609954Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7610099Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.7610177Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7610223Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7610262Z unimplemented [] 2025-12-04T09:58:53.7610328Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7610430Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7611013Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.7611052Z graph_break [] 2025-12-04T09:58:53.7611130Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7611171Z Autotune Choices Stats: 2025-12-04T09:58:53.7611924Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.7612067Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7612192Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7612359Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7612983Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7613588Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7614202Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7614815Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7615423Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7616073Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7616710Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7617337Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7617937Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7618550Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7618698Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.7618739Z Autotune Choices Stats: 2025-12-04T09:58:53.7619504Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.7619728Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7619896Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7620181Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7620836Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7621482Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7622113Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7622750Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7623391Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7624019Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7624646Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7625293Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7625975Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7626601Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7626733Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.7626808Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7626853Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7626891Z unimplemented [] 2025-12-04T09:58:53.7626955Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7627058Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7627648Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7627687Z graph_break [] 2025-12-04T09:58:53.7627762Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7627806Z Autotune Choices Stats: 2025-12-04T09:58:53.7628553Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.7628685Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7628800Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7628969Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7629604Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7630231Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7630839Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7631455Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7632083Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7632694Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7633305Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7633927Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7634553Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7635158Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7635292Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.7635334Z Autotune Choices Stats: 2025-12-04T09:58:53.7636148Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.7636392Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7636560Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7636844Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7637488Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7638135Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7638788Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7639411Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7640048Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7640692Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7641318Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7641951Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7642592Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7643245Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7643379Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.7643460Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7643505Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7643546Z unimplemented [] 2025-12-04T09:58:53.7643607Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7643708Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7644288Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7644340Z graph_break [] 2025-12-04T09:58:53.7644415Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7644459Z Autotune Choices Stats: 2025-12-04T09:58:53.7645207Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.7645338Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7645455Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7645616Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7646273Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7646899Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7647534Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7648136Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7648741Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7649361Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7649966Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7650570Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7651184Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7651811Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7651945Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.7651988Z Autotune Choices Stats: 2025-12-04T09:58:53.7652751Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.7652973Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7653152Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7653435Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7654073Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7654705Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7655339Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7656033Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7656666Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7657299Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7657942Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7658567Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7659201Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7659847Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7659988Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.7660069Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7660127Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7660170Z unimplemented [] 2025-12-04T09:58:53.7660231Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7660336Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7660915Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7660956Z graph_break [] 2025-12-04T09:58:53.7661031Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7661075Z Autotune Choices Stats: 2025-12-04T09:58:53.7661821Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:53.7661961Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7662079Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7662245Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7662863Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7663476Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7664103Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7664728Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7665336Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7665985Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7666607Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7667215Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7667823Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7668441Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7668585Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:53.7668630Z Autotune Choices Stats: 2025-12-04T09:58:53.7669405Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:53.7669629Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7669800Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7670082Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7670730Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7671367Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7671993Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7672637Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7673290Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7673917Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7674541Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7675182Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7675815Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7676480Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7676624Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:53.7676714Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7676758Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7676800Z unimplemented [] 2025-12-04T09:58:53.7676863Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7676966Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7677555Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.7677598Z graph_break [] 2025-12-04T09:58:53.7677677Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7677719Z Autotune Choices Stats: 2025-12-04T09:58:53.7678463Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:53.7678593Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7678725Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7678887Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7679499Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7680110Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7680734Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7681342Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7681957Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7682574Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7683181Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7683797Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7684413Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7685022Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7685173Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:53.7685219Z Autotune Choices Stats: 2025-12-04T09:58:53.7686046Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:53.7686268Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7686437Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7686721Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7687354Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7687998Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7688632Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7689264Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7689904Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7690561Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7691194Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7691820Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7692459Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7693093Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7693224Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:53.7693303Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7693351Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7693389Z unimplemented [] 2025-12-04T09:58:53.7693452Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7693556Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7694154Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7694193Z graph_break [] 2025-12-04T09:58:53.7694271Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7694312Z Autotune Choices Stats: 2025-12-04T09:58:53.7695072Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:53.7695206Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7695323Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7695493Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7696150Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7696780Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7697387Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7698011Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7698630Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7699249Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7699858Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7700462Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7701075Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7701680Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7701815Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:53.7701857Z Autotune Choices Stats: 2025-12-04T09:58:53.7702634Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:53.7702877Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7703057Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7703341Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7703976Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7704611Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7705250Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7705876Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7706659Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7707300Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7707937Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7708569Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7709199Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7709838Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7709973Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:53.7710069Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.7710122Z Traceback (most recent call last): 2025-12-04T09:58:53.7710274Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.7710322Z self.assertTrue( 2025-12-04T09:58:53.7710426Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.7710478Z raise self.failureException(msg) 2025-12-04T09:58:53.7710607Z AssertionError: False is not true : Log file /tmp/tmpq51ab7fj/flex_attention_configs.json was not created 2025-12-04T09:58:53.7710610Z 2025-12-04T09:58:53.7710689Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.7710865Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.7710880Z 2025-12-04T09:58:53.7710976Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.7711053Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7711103Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7711142Z unimplemented [] 2025-12-04T09:58:53.7711208Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7711806Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.7711913Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7711954Z graph_break [] 2025-12-04T09:58:53.7712029Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7712527Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.7712579Z current_size = base.storage().size() 2025-12-04T09:58:53.7712629Z Autotune Choices Stats: 2025-12-04T09:58:53.7713376Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.7713522Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7713640Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7713805Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7714423Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7715038Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7715658Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7716311Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7716918Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7717520Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7718145Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7718747Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7719352Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7719971Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7720117Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.7720163Z Autotune Choices Stats: 2025-12-04T09:58:53.7720935Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.7721163Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7721331Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7721615Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7722270Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7722894Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7723517Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7724158Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7724813Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7725438Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7726104Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7726753Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7727382Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7728005Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7728159Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.7728258Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7728302Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7728344Z unimplemented [] 2025-12-04T09:58:53.7728406Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7728511Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7729111Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7729155Z graph_break [] 2025-12-04T09:58:53.7729230Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7729277Z Autotune Choices Stats: 2025-12-04T09:58:53.7730023Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.7730155Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7730284Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7730447Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7731064Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7731667Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7732279Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7732893Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7733514Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7734118Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7734719Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7735340Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7735975Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7736586Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7736743Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.7736788Z Autotune Choices Stats: 2025-12-04T09:58:53.7737561Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.7737790Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7737962Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7738245Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7738878Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7739519Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7740147Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7740772Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7741413Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7742062Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7742684Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7743313Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7743950Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7744581Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7744713Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.7744792Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7744835Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7744879Z unimplemented [] 2025-12-04T09:58:53.7744942Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7745046Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7745646Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.7745688Z graph_break [] 2025-12-04T09:58:53.7745764Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7745808Z Autotune Choices Stats: 2025-12-04T09:58:53.7746619Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.7746749Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7746869Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7747034Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7747647Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7748266Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7748877Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7749489Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7750114Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7750732Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7751341Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7751947Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7752565Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7753178Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7753309Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.7753356Z Autotune Choices Stats: 2025-12-04T09:58:53.7754130Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.7754365Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7754545Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7754827Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7755476Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7756142Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7756790Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7757424Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7758096Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7758738Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7759381Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7760014Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7760645Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7761290Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7761423Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.7761502Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7761546Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7761590Z unimplemented [] 2025-12-04T09:58:53.7761653Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7761759Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7762344Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7762407Z graph_break [] 2025-12-04T09:58:53.7762486Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7762527Z Autotune Choices Stats: 2025-12-04T09:58:53.7763284Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.7763414Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7763534Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7763702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7764321Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7764928Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7765548Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7766188Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7766815Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7767430Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7768047Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7768656Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7769262Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7769882Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7770014Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.7770061Z Autotune Choices Stats: 2025-12-04T09:58:53.7770826Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.7771060Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7771243Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7771522Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7772167Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7772802Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7773432Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7774070Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7774705Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7775357Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7776027Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7776677Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7777313Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7777942Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7778092Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.7778170Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7778219Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7778261Z unimplemented [] 2025-12-04T09:58:53.7778331Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7778433Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7779015Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7779055Z graph_break [] 2025-12-04T09:58:53.7779136Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7779177Z Autotune Choices Stats: 2025-12-04T09:58:53.7779937Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.7780084Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7780200Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7780380Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7780994Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7781607Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7782231Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7782837Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7783443Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7784066Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7784696Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7785303Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7785907Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7786561Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7786711Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.7786754Z Autotune Choices Stats: 2025-12-04T09:58:53.7787517Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.7787740Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7787914Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7788215Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7788869Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7789512Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7790144Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7790778Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7791419Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7792050Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7792694Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7793352Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7793984Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7794620Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7794756Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.7794846Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7794895Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7794935Z unimplemented [] 2025-12-04T09:58:53.7795003Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7795105Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7795689Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7795731Z graph_break [] 2025-12-04T09:58:53.7795809Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7795852Z Autotune Choices Stats: 2025-12-04T09:58:53.7796643Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:53.7796777Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7796925Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7797093Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7797725Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7798332Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7798943Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7799572Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7800181Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7800788Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7801406Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7802037Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7802642Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7803250Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7803399Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:53.7803444Z Autotune Choices Stats: 2025-12-04T09:58:53.7804215Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:53.7804439Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7804608Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7804893Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7805542Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7806242Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7806869Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7807499Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7808129Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7808782Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7809410Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7810065Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7810720Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7811350Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7811486Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:53.7811563Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7811610Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7811652Z unimplemented [] 2025-12-04T09:58:53.7811719Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7811822Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7812416Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.7812461Z graph_break [] 2025-12-04T09:58:53.7812537Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7812584Z Autotune Choices Stats: 2025-12-04T09:58:53.7813330Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:53.7813464Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7813581Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7813749Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7814375Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7815009Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7815613Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7816252Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7816877Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7817489Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7818099Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7818726Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7819361Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7819967Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7820104Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:53.7820148Z Autotune Choices Stats: 2025-12-04T09:58:53.7820917Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:53.7821155Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7821323Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7821612Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7822255Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7822901Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7823563Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7824196Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7824826Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7825472Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7826144Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7826778Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7827450Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7828106Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7828245Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:53.7828330Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7828376Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7828420Z unimplemented [] 2025-12-04T09:58:53.7828483Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7828589Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7829165Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7829223Z graph_break [] 2025-12-04T09:58:53.7829301Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7829346Z Autotune Choices Stats: 2025-12-04T09:58:53.7830095Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:53.7830227Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7830349Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7830514Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7831131Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7831751Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7832378Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7832998Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7833611Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7834225Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7834836Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7835448Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7836105Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7836737Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7836881Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:53.7836928Z Autotune Choices Stats: 2025-12-04T09:58:53.7837696Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:53.7837922Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7838101Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7838386Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7839023Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7839655Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7840294Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7840962Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7841599Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7842234Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7842872Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7843506Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7844144Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7844788Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7844930Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:53.7845013Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7845061Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7845118Z unimplemented [] 2025-12-04T09:58:53.7845182Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7845291Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7845873Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7845919Z graph_break [] 2025-12-04T09:58:53.7846022Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7846070Z Autotune Choices Stats: 2025-12-04T09:58:53.7846821Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:53.7846968Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7847088Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7847255Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7847874Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7848490Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7849115Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7849750Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7850369Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7850983Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7851597Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7852206Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7852822Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7853444Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7853587Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:53.7853633Z Autotune Choices Stats: 2025-12-04T09:58:53.7854410Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.7854640Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7854818Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7855101Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7855758Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7856418Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7857053Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7857704Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7858369Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7859005Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7859640Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7860283Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7860920Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7861553Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7861696Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:53.7861809Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.7861859Z Traceback (most recent call last): 2025-12-04T09:58:53.7862019Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.7862062Z self.assertTrue( 2025-12-04T09:58:53.7862175Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.7862226Z raise self.failureException(msg) 2025-12-04T09:58:53.7862362Z AssertionError: False is not true : Log file /tmp/tmpuowzk5ja/flex_attention_configs.json was not created 2025-12-04T09:58:53.7862364Z 2025-12-04T09:58:53.7862451Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.7862623Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.7862627Z 2025-12-04T09:58:53.7862722Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.7862805Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7862857Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7862899Z unimplemented [] 2025-12-04T09:58:53.7862963Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7863553Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.7863661Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7863713Z graph_break [] 2025-12-04T09:58:53.7863795Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7864287Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.7864343Z current_size = base.storage().size() 2025-12-04T09:58:53.7864386Z Autotune Choices Stats: 2025-12-04T09:58:53.7865140Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.7865276Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7865393Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7865565Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7866232Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7866868Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7867474Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7868085Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7868703Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7869310Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7869922Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7870542Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7871166Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7871773Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7871911Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.7871954Z Autotune Choices Stats: 2025-12-04T09:58:53.7872717Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.7872960Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7873131Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7873419Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7874055Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7874711Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7875365Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7876027Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7876658Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7877312Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7877945Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7878578Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7879218Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7879874Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7880012Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.7880091Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7880148Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7880204Z unimplemented [] 2025-12-04T09:58:53.7880274Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7880377Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7880964Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7881020Z graph_break [] 2025-12-04T09:58:53.7881097Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7881144Z Autotune Choices Stats: 2025-12-04T09:58:53.7881890Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.7882024Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7882144Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7882314Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7882930Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7883546Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7884184Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7884794Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7885401Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7886051Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7886656Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7887263Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7887881Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7888518Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7888657Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.7888702Z Autotune Choices Stats: 2025-12-04T09:58:53.7889468Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.7889696Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7889884Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7890170Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7890812Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7891441Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7892084Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7892740Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7893374Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7894003Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7894645Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7895281Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7895913Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7896585Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7896734Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.7896813Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7896863Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7896905Z unimplemented [] 2025-12-04T09:58:53.7896987Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7897091Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7897677Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.7897723Z graph_break [] 2025-12-04T09:58:53.7897801Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7897848Z Autotune Choices Stats: 2025-12-04T09:58:53.7898596Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.7898741Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7898857Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7899027Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7899650Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7900254Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7900874Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7901504Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7902120Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7902724Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7903339Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7903947Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7904552Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7905164Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7905312Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.7905357Z Autotune Choices Stats: 2025-12-04T09:58:53.7906181Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.7906412Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7906579Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7906869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7907523Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7908154Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7908788Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7909434Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7910092Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7910725Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7911361Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7912008Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7912644Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7913277Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7913415Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.7913523Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7913569Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7913613Z unimplemented [] 2025-12-04T09:58:53.7913676Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7913783Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7914372Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7914418Z graph_break [] 2025-12-04T09:58:53.7914497Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7914548Z Autotune Choices Stats: 2025-12-04T09:58:53.7915299Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.7915433Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7915565Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7915734Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7916385Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7916995Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7917605Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7918237Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7918896Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7919520Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7920124Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7920757Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7921365Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7921978Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7922128Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.7922196Z Autotune Choices Stats: 2025-12-04T09:58:53.7922966Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.7923193Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7923367Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7923651Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7924300Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7924946Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7925578Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7926248Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7926909Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7927565Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7928193Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7928825Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7929470Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7930103Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7930237Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.7930318Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7930363Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7930408Z unimplemented [] 2025-12-04T09:58:53.7930473Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7930582Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7931174Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7931232Z graph_break [] 2025-12-04T09:58:53.7931309Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7931357Z Autotune Choices Stats: 2025-12-04T09:58:53.7932119Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.7932250Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7932371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7932538Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7933158Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7933780Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7934383Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7935038Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7935656Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7936322Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7936929Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7937545Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7938168Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7938777Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7938911Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.7938956Z Autotune Choices Stats: 2025-12-04T09:58:53.7939748Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.7939990Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7940176Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7940460Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7941104Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7941743Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7942385Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7943021Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7943664Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7944308Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7944959Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7945586Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7946260Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7946917Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7947050Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.7947134Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7947177Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7947220Z unimplemented [] 2025-12-04T09:58:53.7947282Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7947385Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7947961Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7948004Z graph_break [] 2025-12-04T09:58:53.7948099Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7948154Z Autotune Choices Stats: 2025-12-04T09:58:53.7948913Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:53.7949042Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7949161Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7949325Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7949937Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7950542Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7951160Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7951767Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7952385Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7953009Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7953624Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7954229Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7954838Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7955466Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7955596Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:53.7955642Z Autotune Choices Stats: 2025-12-04T09:58:53.7956450Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:53.7956670Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7956876Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7957158Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7957816Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7958445Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7959072Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7959718Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7960350Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7960993Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7961633Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7962273Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7962950Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7963602Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7963762Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:53.7963868Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7964043Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7964103Z unimplemented [] 2025-12-04T09:58:53.7964202Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7964316Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7964919Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.7964994Z graph_break [] 2025-12-04T09:58:53.7965092Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7965165Z Autotune Choices Stats: 2025-12-04T09:58:53.7965979Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:53.7966154Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7966305Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7966507Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7967147Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7967769Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7968399Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7972353Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7972972Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7973599Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7974216Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7974840Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7975444Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7976086Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7976242Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:53.7976286Z Autotune Choices Stats: 2025-12-04T09:58:53.7977049Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:53.7977270Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7977440Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7977735Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7978380Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7979027Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7979651Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7980271Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7980914Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7981545Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7982184Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7982834Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7983478Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7984103Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7984237Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:53.7984328Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.7984376Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.7984415Z unimplemented [] 2025-12-04T09:58:53.7984483Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.7984585Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.7985165Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.7985205Z graph_break [] 2025-12-04T09:58:53.7985286Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.7985329Z Autotune Choices Stats: 2025-12-04T09:58:53.7986111Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:53.7986241Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7986374Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7986553Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7987173Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7987780Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7988386Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7989005Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7989606Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7990209Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7990822Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.7991452Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7992052Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7992655Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7992789Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:53.7992843Z Autotune Choices Stats: 2025-12-04T09:58:53.7993601Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:53.7993823Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.7993993Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.7994276Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.7994910Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7995540Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7996219Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7996840Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7997464Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7998104Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7998731Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.7999371Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8000017Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8000642Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8000777Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:53.8000854Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8000901Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8000941Z unimplemented [] 2025-12-04T09:58:53.8001009Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8001112Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8001688Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8001741Z graph_break [] 2025-12-04T09:58:53.8001819Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8001862Z Autotune Choices Stats: 2025-12-04T09:58:53.8002606Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:53.8002739Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8002855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8003021Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8003642Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8004260Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8004867Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8005477Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8006132Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8006735Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8007344Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8007975Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8008597Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8009198Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8009333Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:53.8009375Z Autotune Choices Stats: 2025-12-04T09:58:53.8010130Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.8010364Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8010530Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8010811Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8011450Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8012082Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8012729Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8013353Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8013986Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8014624Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8015253Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8015881Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8016557Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8017203Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8017336Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:53.8017414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8017463Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8017503Z unimplemented [] 2025-12-04T09:58:53.8017569Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8017670Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8018245Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8018287Z graph_break [] 2025-12-04T09:58:53.8018376Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8018422Z Autotune Choices Stats: 2025-12-04T09:58:53.8019160Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:53.8019293Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8019408Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8019573Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8020187Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8020797Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8021420Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8022024Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8022630Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8023247Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8023852Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8024454Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8025071Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8025693Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8025826Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:53.8025871Z Autotune Choices Stats: 2025-12-04T09:58:53.8026664Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:53.8026885Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8027070Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8027353Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8027986Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8028609Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8029251Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8029898Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8030526Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8031152Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8031787Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8032414Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8033036Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8033674Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8033820Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:53.8033916Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.8033969Z Traceback (most recent call last): 2025-12-04T09:58:53.8034135Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.8034178Z self.assertTrue( 2025-12-04T09:58:53.8034287Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.8034339Z raise self.failureException(msg) 2025-12-04T09:58:53.8034472Z AssertionError: False is not true : Log file /tmp/tmpvkf0nme2/flex_attention_configs.json was not created 2025-12-04T09:58:53.8034475Z 2025-12-04T09:58:53.8034553Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.8034719Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.8034722Z 2025-12-04T09:58:53.8034812Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.8034892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8034938Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8034982Z unimplemented [] 2025-12-04T09:58:53.8035045Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8035629Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.8035742Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8035783Z graph_break [] 2025-12-04T09:58:53.8035860Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8036390Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.8036444Z current_size = base.storage().size() 2025-12-04T09:58:53.8036486Z Autotune Choices Stats: 2025-12-04T09:58:53.8037229Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.8037358Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8037494Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8037672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8038300Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8038906Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8039507Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8040123Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8040729Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8041329Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8041940Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8042564Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8043165Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8043766Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8043897Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.8043953Z Autotune Choices Stats: 2025-12-04T09:58:53.8044719Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.8044938Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8045109Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8045389Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8046068Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8046709Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8047346Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8047965Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8048591Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8049242Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8049865Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8050504Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8051153Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8051775Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8051908Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.8051983Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8052030Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8052069Z unimplemented [] 2025-12-04T09:58:53.8052137Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8052237Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8052813Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8052863Z graph_break [] 2025-12-04T09:58:53.8052941Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8052984Z Autotune Choices Stats: 2025-12-04T09:58:53.8053729Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.8053862Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8053976Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8054142Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8054756Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8055384Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8056011Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8056608Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8057231Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8057835Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8058437Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8059048Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8059673Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8060276Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8060410Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.8060453Z Autotune Choices Stats: 2025-12-04T09:58:53.8061214Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.8061442Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8061612Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8061894Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8062525Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8063159Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8063801Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8064427Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8065052Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8065676Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8066350Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8066977Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8067619Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8068268Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8068400Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.8068479Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8068526Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8068566Z unimplemented [] 2025-12-04T09:58:53.8068633Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8068734Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8069309Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.8069349Z graph_break [] 2025-12-04T09:58:53.8069444Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8069487Z Autotune Choices Stats: 2025-12-04T09:58:53.8070225Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.8070355Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8070472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8070638Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8071246Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8071861Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8072484Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8073088Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8073691Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8074305Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8074905Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8075514Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8076167Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8076797Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8076929Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.8076973Z Autotune Choices Stats: 2025-12-04T09:58:53.8077738Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.8077955Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8078134Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8078414Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8079041Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8079671Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8080303Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8080950Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8081578Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8082213Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8082844Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8083469Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8084096Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8084733Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8084879Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.8084954Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8084999Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8085038Z unimplemented [] 2025-12-04T09:58:53.8085103Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8085215Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8085788Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8085830Z graph_break [] 2025-12-04T09:58:53.8085905Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8085987Z Autotune Choices Stats: 2025-12-04T09:58:53.8086725Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.8086886Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8086999Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8087162Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8087773Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8088375Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8088989Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8089626Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8090229Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8090833Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8091442Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8092044Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8092653Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8093262Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8093404Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.8093446Z Autotune Choices Stats: 2025-12-04T09:58:53.8094209Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.8094429Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8094597Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8094881Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8095519Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8096178Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8096802Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8097446Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8098108Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8098731Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8099361Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8100000Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8100624Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8101246Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8101381Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.8101468Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8101529Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8101569Z unimplemented [] 2025-12-04T09:58:53.8101635Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8101735Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8102328Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8102372Z graph_break [] 2025-12-04T09:58:53.8102448Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8102497Z Autotune Choices Stats: 2025-12-04T09:58:53.8103235Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.8103366Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8103485Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8103660Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8104271Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8104873Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8105482Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8106150Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8106776Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8107376Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8107985Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8108603Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8109207Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8109809Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8109942Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.8109999Z Autotune Choices Stats: 2025-12-04T09:58:53.8110769Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.8110999Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8111167Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8111448Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8112083Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8112724Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8113352Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8113977Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8114617Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8115262Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8115891Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8116558Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8117203Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8117827Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8117960Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.8118041Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8118085Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8118129Z unimplemented [] 2025-12-04T09:58:53.8118191Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8118294Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8118882Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8118947Z graph_break [] 2025-12-04T09:58:53.8119023Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8119068Z Autotune Choices Stats: 2025-12-04T09:58:53.8119821Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:53.8119953Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8120071Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8120233Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8120848Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8121468Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8122071Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8122677Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8123290Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8123917Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8124513Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8125114Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8125733Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8126366Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8126497Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:53.8126543Z Autotune Choices Stats: 2025-12-04T09:58:53.8127317Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:53.8127551Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8127725Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8128018Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8128647Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8129267Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8129902Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8130525Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8131156Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8131791Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8132437Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8133063Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8133687Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8134325Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8134455Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:53.8134536Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8134581Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8134625Z unimplemented [] 2025-12-04T09:58:53.8134687Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8134792Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8135365Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.8135407Z graph_break [] 2025-12-04T09:58:53.8135483Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8135554Z Autotune Choices Stats: 2025-12-04T09:58:53.8136342Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:53.8136471Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8136588Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8136751Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8137362Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8137962Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8138582Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8139184Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8139800Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8140419Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8141035Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8141639Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8142241Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8142857Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8142989Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:53.8143037Z Autotune Choices Stats: 2025-12-04T09:58:53.8143787Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:53.8144010Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8144189Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8144474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8145118Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8145742Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8146396Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8147031Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8147661Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8148287Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8148923Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8149578Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8150206Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8150829Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8150977Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:53.8151055Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8151099Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8151140Z unimplemented [] 2025-12-04T09:58:53.8151202Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8151308Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8151886Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8151927Z graph_break [] 2025-12-04T09:58:53.8152007Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8152049Z Autotune Choices Stats: 2025-12-04T09:58:53.8152796Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:53.8152934Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8153050Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8153214Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8153838Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8154443Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8155046Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8155659Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8156332Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8156948Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8157560Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8158172Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8158779Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8159380Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8159522Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:53.8159567Z Autotune Choices Stats: 2025-12-04T09:58:53.8160320Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:53.8160541Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8160709Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8160988Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8161631Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8162277Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8162901Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8163520Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8164157Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8164786Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8165419Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8166089Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8166738Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8167370Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8167502Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:53.8167578Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8167647Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8167687Z unimplemented [] 2025-12-04T09:58:53.8167753Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8167854Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8168425Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8168464Z graph_break [] 2025-12-04T09:58:53.8168540Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8168585Z Autotune Choices Stats: 2025-12-04T09:58:53.8169331Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:53.8169462Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8169601Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8169780Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8170399Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8170997Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8171604Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8172206Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8172822Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8173426Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8174040Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8174648Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8175262Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8175867Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8176035Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:53.8176097Z Autotune Choices Stats: 2025-12-04T09:58:53.8176855Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.8177073Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8177243Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8177524Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8178167Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8178802Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8179438Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8180061Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8180687Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8181321Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8181948Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8182584Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8183222Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8183856Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8183991Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:53.8184067Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8184112Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8184152Z unimplemented [] 2025-12-04T09:58:53.8184218Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8184319Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8184900Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8184951Z graph_break [] 2025-12-04T09:58:53.8185029Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8185071Z Autotune Choices Stats: 2025-12-04T09:58:53.8185805Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:53.8185969Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8186084Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8186248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8186882Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8187509Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8188113Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8188715Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8189332Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8189936Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8190540Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8191154Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8191777Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8192380Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8192514Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:53.8192556Z Autotune Choices Stats: 2025-12-04T09:58:53.8193316Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:53.8193553Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8193719Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8193999Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8194629Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8195265Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8195898Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8196583Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8197209Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8197836Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8198471Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8199100Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8199748Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8200399Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8200530Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:53.8200608Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8200656Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8200695Z unimplemented [] 2025-12-04T09:58:53.8200761Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8200860Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8201433Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8201477Z graph_break [] 2025-12-04T09:58:53.8201551Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8201610Z Autotune Choices Stats: 2025-12-04T09:58:53.8202343Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:53.8202474Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8202589Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8202753Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8203369Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8203980Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8204609Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8205211Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8205811Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8206462Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8207069Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8207673Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8208290Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8208919Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8209053Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:53.8209096Z Autotune Choices Stats: 2025-12-04T09:58:53.8209850Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:53.8210070Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8210236Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8210529Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8211170Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8211790Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8212430Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8213078Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8213705Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8214329Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8214967Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8215598Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8216264Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8216901Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8217047Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:53.8217144Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.8217195Z Traceback (most recent call last): 2025-12-04T09:58:53.8217350Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.8217407Z self.assertTrue( 2025-12-04T09:58:53.8217516Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.8217567Z raise self.failureException(msg) 2025-12-04T09:58:53.8217701Z AssertionError: False is not true : Log file /tmp/tmpm2zzky_x/flex_attention_configs.json was not created 2025-12-04T09:58:53.8217704Z 2025-12-04T09:58:53.8217780Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.8217948Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.8217951Z 2025-12-04T09:58:53.8218043Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.8218121Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8218165Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8218207Z unimplemented [] 2025-12-04T09:58:53.8218271Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8218847Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.8218962Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8219003Z graph_break [] 2025-12-04T09:58:53.8219077Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8219563Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.8219617Z current_size = base.storage().size() 2025-12-04T09:58:53.8219660Z Autotune Choices Stats: 2025-12-04T09:58:53.8220405Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.8220534Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8220663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8220836Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8221456Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8222063Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8222667Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8223267Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8223887Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8224492Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8225103Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8225710Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8226369Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8226972Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8227103Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.8227164Z Autotune Choices Stats: 2025-12-04T09:58:53.8227919Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.8228136Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8228309Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8228589Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8229228Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8229864Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8230504Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8231129Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8231757Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8232393Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8233023Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8233657Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8234294Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8234925Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8235057Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.8235136Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8235185Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8235225Z unimplemented [] 2025-12-04T09:58:53.8235288Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8235393Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8236012Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8236072Z graph_break [] 2025-12-04T09:58:53.8236152Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8236195Z Autotune Choices Stats: 2025-12-04T09:58:53.8236932Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.8237065Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8237181Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8237345Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8237962Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8238592Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8239192Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8239795Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8240412Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8241015Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8241621Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8242233Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8242844Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8243457Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8243593Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.8243636Z Autotune Choices Stats: 2025-12-04T09:58:53.8244396Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.8244625Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8244793Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8245075Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8245702Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8246386Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8247026Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8247656Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8248286Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8248913Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8249547Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8250170Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8250805Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8251454Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8251589Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.8251665Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8251715Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8251755Z unimplemented [] 2025-12-04T09:58:53.8251821Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8251921Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8252506Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.8252546Z graph_break [] 2025-12-04T09:58:53.8252625Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8252680Z Autotune Choices Stats: 2025-12-04T09:58:53.8253418Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.8253550Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8253665Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8253832Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8254439Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8255054Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8255679Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8256306Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8256910Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8257532Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8258135Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8258737Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8259360Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8259988Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8260119Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.8260160Z Autotune Choices Stats: 2025-12-04T09:58:53.8260919Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.8261141Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8261309Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8261600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8262232Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8262856Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8263496Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8264143Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8264768Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8265395Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8266060Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8266710Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8267333Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8267972Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8268117Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.8268193Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8268241Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8268281Z unimplemented [] 2025-12-04T09:58:53.8268346Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8268459Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8269033Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8269073Z graph_break [] 2025-12-04T09:58:53.8269151Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8269193Z Autotune Choices Stats: 2025-12-04T09:58:53.8269945Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.8270091Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8270207Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8270372Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8270988Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8271589Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8272203Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8272826Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8273427Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8274031Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8274650Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8275254Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8275851Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8276508Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8276653Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.8276694Z Autotune Choices Stats: 2025-12-04T09:58:53.8277461Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.8277680Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8277844Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8278123Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8278758Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8279399Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8280023Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8280654Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8281309Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8281930Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8282552Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8283193Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8283819Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8284437Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8284567Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.8284641Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8284699Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8284749Z unimplemented [] 2025-12-04T09:58:53.8284814Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8284913Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8285503Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8285544Z graph_break [] 2025-12-04T09:58:53.8285618Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8285662Z Autotune Choices Stats: 2025-12-04T09:58:53.8286420Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.8286551Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8286666Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8286857Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8287468Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8288074Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8288680Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8289293Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8289923Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8290527Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8291133Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8291750Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8292353Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8292954Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8293086Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.8293128Z Autotune Choices Stats: 2025-12-04T09:58:53.8293901Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.8294150Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8294319Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8294601Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8295237Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8295870Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8296522Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8297146Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8297788Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8298441Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8299070Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8299697Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8300340Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8300970Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8301105Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.8301184Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8301228Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8301271Z unimplemented [] 2025-12-04T09:58:53.8301335Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8301438Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8302020Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8302078Z graph_break [] 2025-12-04T09:58:53.8302153Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8302198Z Autotune Choices Stats: 2025-12-04T09:58:53.8302946Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:53.8303078Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8303195Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8303355Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8303965Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8304582Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8305185Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8305792Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8306456Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8307085Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8307691Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8308296Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8308913Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8309514Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8309647Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:53.8309691Z Autotune Choices Stats: 2025-12-04T09:58:53.8310462Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:53.8310700Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8310869Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8311158Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8311790Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8312416Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8313046Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8313671Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8314305Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8314941Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8315584Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8316253Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8316880Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8317525Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8317655Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:53.8317735Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8317782Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8317826Z unimplemented [] 2025-12-04T09:58:53.8317890Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8317995Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8318571Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.8318613Z graph_break [] 2025-12-04T09:58:53.8318688Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8318735Z Autotune Choices Stats: 2025-12-04T09:58:53.8319519Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:53.8319660Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8319779Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8319943Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8320564Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8321169Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8321783Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8322385Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8322992Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8323801Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8324428Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8325035Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8325643Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8326292Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8326422Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:53.8326467Z Autotune Choices Stats: 2025-12-04T09:58:53.8327225Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:53.8327448Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8327646Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8327941Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8328587Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8329214Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8329836Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8330472Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8331101Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8331732Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8332367Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8333017Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8333644Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8334272Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8334414Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:53.8334493Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8334538Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8334582Z unimplemented [] 2025-12-04T09:58:53.8334644Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8334748Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8335321Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8335364Z graph_break [] 2025-12-04T09:58:53.8335443Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8335484Z Autotune Choices Stats: 2025-12-04T09:58:53.8336295Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:53.8336438Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8336557Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8336718Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8337347Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8337955Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8338562Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8339177Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8339782Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8340388Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8341014Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8341625Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8342227Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8342833Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8342978Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:53.8343023Z Autotune Choices Stats: 2025-12-04T09:58:53.8343782Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:53.8344002Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8344169Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8344446Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8345095Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8345744Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8346396Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8347020Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8347660Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8348287Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8348909Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8349550Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8350206Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8350831Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8350963Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:53.8351044Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8351102Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8351143Z unimplemented [] 2025-12-04T09:58:53.8351205Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8351310Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8351889Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8351927Z graph_break [] 2025-12-04T09:58:53.8352007Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8352052Z Autotune Choices Stats: 2025-12-04T09:58:53.8352796Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:53.8352928Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8353044Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8353234Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8353842Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8354459Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8355063Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8355671Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8356326Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8356928Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8357550Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8358168Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8358789Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8359394Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8359527Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:53.8359569Z Autotune Choices Stats: 2025-12-04T09:58:53.8360347Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.8360571Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8360741Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8361025Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8361652Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8362290Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8362937Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8363559Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8364191Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8364836Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8365467Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8366157Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8366809Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8367453Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8367587Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:53.8367663Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8367711Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8367751Z unimplemented [] 2025-12-04T09:58:53.8367815Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8367917Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8368492Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8368547Z graph_break [] 2025-12-04T09:58:53.8368624Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8368666Z Autotune Choices Stats: 2025-12-04T09:58:53.8369410Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:53.8369541Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8369656Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8369822Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8370444Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8371071Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8371678Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8372284Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8372882Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8373502Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8374107Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8374720Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8375339Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8375997Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8376132Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:53.8376174Z Autotune Choices Stats: 2025-12-04T09:58:53.8376935Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:53.8377168Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8377334Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8377614Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8378252Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8378894Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8379533Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8380171Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8380796Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8381425Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8382065Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8382695Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8383325Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8383961Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8384103Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:53.8384180Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8384230Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8384273Z unimplemented [] 2025-12-04T09:58:53.8384339Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8384441Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8385016Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8385058Z graph_break [] 2025-12-04T09:58:53.8385134Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8385190Z Autotune Choices Stats: 2025-12-04T09:58:53.8385967Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:53.8386103Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8386218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8386386Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8387002Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8387698Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8388328Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8388932Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8389536Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8390153Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8390759Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8391365Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8391985Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8392600Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8392745Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:53.8392788Z Autotune Choices Stats: 2025-12-04T09:58:53.8393549Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:53.8393771Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8393938Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8394230Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8394861Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8395486Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8396169Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8396808Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8397450Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8398077Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8398701Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8399342Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8399974Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8400607Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8400751Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:53.8400827Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8400875Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8400914Z unimplemented [] 2025-12-04T09:58:53.8400980Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8401095Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8401675Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8401718Z graph_break [] 2025-12-04T09:58:53.8401795Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8401839Z Autotune Choices Stats: 2025-12-04T09:58:53.8402576Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:53.8402719Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8402837Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8403001Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8403616Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8404219Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8404838Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8405466Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8406103Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8406707Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8407342Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8407947Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8408549Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8409160Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8409305Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:53.8409348Z Autotune Choices Stats: 2025-12-04T09:58:53.8410126Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:53.8410349Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8410516Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8410796Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8411430Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8412069Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8412692Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8413327Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8413981Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8414605Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8415231Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8415871Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8416538Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8417154Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8417287Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:53.8417385Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.8417436Z Traceback (most recent call last): 2025-12-04T09:58:53.8417621Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.8417664Z self.assertTrue( 2025-12-04T09:58:53.8417773Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.8417825Z raise self.failureException(msg) 2025-12-04T09:58:53.8417956Z AssertionError: False is not true : Log file /tmp/tmpr1x5bd4b/flex_attention_configs.json was not created 2025-12-04T09:58:53.8417958Z 2025-12-04T09:58:53.8418035Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.8418218Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.8418220Z 2025-12-04T09:58:53.8418313Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.8418395Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8418442Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8418486Z unimplemented [] 2025-12-04T09:58:53.8418549Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8419130Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.8419234Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8419275Z graph_break [] 2025-12-04T09:58:53.8419352Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8419851Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.8419905Z current_size = base.storage().size() 2025-12-04T09:58:53.8419948Z Autotune Choices Stats: 2025-12-04T09:58:53.8420694Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.8420826Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8420948Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8421112Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8421731Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8422364Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8422971Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8423570Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8424172Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8424788Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8425392Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8426038Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8426649Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8427275Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8427409Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.8427457Z Autotune Choices Stats: 2025-12-04T09:58:53.8428218Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.8428448Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8428617Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8428894Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8429529Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8430166Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8430798Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8431431Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8432060Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8432687Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8433319Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8433951Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8434602Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8435234Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8435378Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.8435455Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8435503Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8435544Z unimplemented [] 2025-12-04T09:58:53.8435610Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8435710Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8436324Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8436363Z graph_break [] 2025-12-04T09:58:53.8436442Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8436497Z Autotune Choices Stats: 2025-12-04T09:58:53.8437235Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.8437366Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8437481Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8437650Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8438266Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8438885Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8439523Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8440125Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8440731Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8441345Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8441948Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8442549Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8443160Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8443771Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8443916Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.8443961Z Autotune Choices Stats: 2025-12-04T09:58:53.8444721Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.8444941Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8445111Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8445404Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8446070Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8446695Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8447341Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8447973Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8448609Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8449246Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8449867Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8450513Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8451133Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8451771Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8451918Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.8451994Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8452043Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8452083Z unimplemented [] 2025-12-04T09:58:53.8452150Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8452251Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8452841Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.8452882Z graph_break [] 2025-12-04T09:58:53.8452962Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8453004Z Autotune Choices Stats: 2025-12-04T09:58:53.8453748Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.8453889Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8454005Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8454172Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8454783Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8455389Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8456060Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8456688Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8457289Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8457896Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8458512Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8459115Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8459712Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8460336Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8460482Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.8460524Z Autotune Choices Stats: 2025-12-04T09:58:53.8461292Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.8461515Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8461680Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8461963Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8462602Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8463240Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8463864Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8464504Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8465149Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8465778Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8466447Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8467103Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8467727Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8468353Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8468485Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.8468563Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8468611Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8468665Z unimplemented [] 2025-12-04T09:58:53.8468747Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8468847Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8469423Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8469485Z graph_break [] 2025-12-04T09:58:53.8469562Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8469606Z Autotune Choices Stats: 2025-12-04T09:58:53.8470344Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.8470477Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8470593Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8470776Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8471395Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8471999Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8472604Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8480330Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8481087Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8481783Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8482430Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8483060Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8483667Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8484283Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8484422Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.8484468Z Autotune Choices Stats: 2025-12-04T09:58:53.8485250Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.8485497Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8485671Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8486008Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8486710Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8487481Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8488141Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8488773Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8489427Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8490090Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8490718Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8491353Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8492004Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8492628Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8492772Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.8492856Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8492909Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8492951Z unimplemented [] 2025-12-04T09:58:53.8493021Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8493127Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8493722Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8493781Z graph_break [] 2025-12-04T09:58:53.8493860Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8493908Z Autotune Choices Stats: 2025-12-04T09:58:53.8494676Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.8494816Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8494938Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8495103Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8495716Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8496363Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8496971Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8497576Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8500018Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8500694Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8501318Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8501931Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8502554Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8503161Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8503301Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.8503343Z Autotune Choices Stats: 2025-12-04T09:58:53.8504105Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.8504339Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8504574Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8504871Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8505502Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8506172Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8506820Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8507451Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8508084Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8508709Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8509387Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8510011Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8510643Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8511284Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8511421Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.8511500Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8511551Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8511594Z unimplemented [] 2025-12-04T09:58:53.8511664Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8511789Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8512542Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8512663Z graph_break [] 2025-12-04T09:58:53.8512829Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8512906Z Autotune Choices Stats: 2025-12-04T09:58:53.8518163Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:53.8518422Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8518549Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8518741Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8519371Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8519983Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8520641Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8521340Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8542538Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8543235Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8543883Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8544494Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8545099Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8545730Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8545875Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:53.8545956Z Autotune Choices Stats: 2025-12-04T09:58:53.8546728Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:53.8546965Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8547137Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8547442Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8548104Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8548742Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8549374Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8550021Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8550655Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8551295Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8551939Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8552595Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8553228Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8553859Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8554019Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:53.8554102Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8554161Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8554206Z unimplemented [] 2025-12-04T09:58:53.8554281Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8554389Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8554974Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.8555026Z graph_break [] 2025-12-04T09:58:53.8555107Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8555187Z Autotune Choices Stats: 2025-12-04T09:58:53.8555981Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:53.8556139Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8556276Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8556454Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8557103Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8557730Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8558335Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8558969Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8559572Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8560192Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8560817Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8561449Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8562056Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8562671Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8562823Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:53.8562869Z Autotune Choices Stats: 2025-12-04T09:58:53.8563635Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:53.8563868Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8564041Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8564333Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8564978Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8565635Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8566309Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8566946Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8567594Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8568229Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8568870Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8569519Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8570184Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8570821Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8570962Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:53.8571042Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8571099Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8571154Z unimplemented [] 2025-12-04T09:58:53.8571231Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8571338Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8571914Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8571954Z graph_break [] 2025-12-04T09:58:53.8572027Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8572068Z Autotune Choices Stats: 2025-12-04T09:58:53.8572804Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:53.8572938Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8573057Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8573220Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8573858Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8574467Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8575068Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8575671Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8576310Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8576906Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8577513Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8578146Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8578758Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8579358Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8579489Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:53.8579531Z Autotune Choices Stats: 2025-12-04T09:58:53.8580292Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:53.8580524Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8580689Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8580968Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8581604Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8582241Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8582885Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8583507Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8584143Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8584781Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8585401Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8586061Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8586733Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8587362Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8587494Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:53.8587570Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8587614Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8587652Z unimplemented [] 2025-12-04T09:58:53.8587713Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8587814Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8588393Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8588446Z graph_break [] 2025-12-04T09:58:53.8588519Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8588561Z Autotune Choices Stats: 2025-12-04T09:58:53.8589296Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:53.8589426Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8589542Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8589704Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8590314Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8590938Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8591548Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8592149Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8592757Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8593373Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8593975Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8594577Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8595216Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8595829Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8595997Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:53.8596038Z Autotune Choices Stats: 2025-12-04T09:58:53.8596793Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.8597033Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8597202Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8597478Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8598109Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8598735Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8599408Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8600042Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8600669Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8601299Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8601935Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8602554Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8603181Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8603836Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8603975Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:53.8604052Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8604095Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8604135Z unimplemented [] 2025-12-04T09:58:53.8604197Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8604299Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8604869Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8604909Z graph_break [] 2025-12-04T09:58:53.8604983Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8605025Z Autotune Choices Stats: 2025-12-04T09:58:53.8605766Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:53.8605907Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8606055Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8606215Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8606829Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8607432Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8608081Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8608677Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8609278Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8609881Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8610497Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8611098Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8611705Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8612326Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8612468Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:53.8612509Z Autotune Choices Stats: 2025-12-04T09:58:53.8613259Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:53.8613480Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8613645Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8613929Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8614558Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8615175Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8615797Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8616476Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8617116Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8617741Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8618365Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8619015Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8619638Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8620261Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8620401Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:53.8620477Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8620530Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8620569Z unimplemented [] 2025-12-04T09:58:53.8620630Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8620732Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8621317Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8621356Z graph_break [] 2025-12-04T09:58:53.8621432Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8621474Z Autotune Choices Stats: 2025-12-04T09:58:53.8622212Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:53.8622351Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8622466Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8622626Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8623239Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8623844Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8624445Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8625074Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8625677Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8626316Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8626935Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8627536Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8628140Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8628738Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8628879Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:53.8628919Z Autotune Choices Stats: 2025-12-04T09:58:53.8629704Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:53.8629922Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8630088Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8630364Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8630987Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8631621Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8632247Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8632867Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8633513Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8634158Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8634778Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8635402Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8636103Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8636731Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8636861Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:53.8636936Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8636980Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8637018Z unimplemented [] 2025-12-04T09:58:53.8637080Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8637200Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8637787Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8637825Z graph_break [] 2025-12-04T09:58:53.8637914Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8637955Z Autotune Choices Stats: 2025-12-04T09:58:53.8638699Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:53.8638829Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8638942Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8639103Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8639722Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8640321Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8640927Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8641525Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8642160Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8642768Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8643377Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8643987Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8644586Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8645193Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8645324Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:53.8645366Z Autotune Choices Stats: 2025-12-04T09:58:53.8646170Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:53.8646411Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8646576Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8646854Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8647483Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8648106Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8648750Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8649378Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8650009Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8650670Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8651290Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8651922Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8652555Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8653178Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8653310Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:53.8653385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8653430Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8653469Z unimplemented [] 2025-12-04T09:58:53.8653530Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8653629Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8654202Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.8654250Z graph_break [] 2025-12-04T09:58:53.8654325Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8654365Z Autotune Choices Stats: 2025-12-04T09:58:53.8655119Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:53.8655248Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8655362Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8655523Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8656174Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8656792Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8657391Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8657996Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8658595Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8659239Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8659838Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8660444Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8661052Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8661653Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8661784Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:53.8661824Z Autotune Choices Stats: 2025-12-04T09:58:53.8662590Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.8662818Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8662994Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8663281Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8663907Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8664536Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8665164Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8665785Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8666443Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8667070Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8667763Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8668393Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8669024Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8669661Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8669794Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:53.8669889Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.8669941Z Traceback (most recent call last): 2025-12-04T09:58:53.8670094Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.8670139Z self.assertTrue( 2025-12-04T09:58:53.8670245Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.8670298Z raise self.failureException(msg) 2025-12-04T09:58:53.8670426Z AssertionError: False is not true : Log file /tmp/tmph000f7qb/flex_attention_configs.json was not created 2025-12-04T09:58:53.8670433Z 2025-12-04T09:58:53.8670509Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.8670677Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.8670679Z 2025-12-04T09:58:53.8670772Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.8670852Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8670898Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8670953Z unimplemented [] 2025-12-04T09:58:53.8671015Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8671613Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.8671724Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8671766Z graph_break [] 2025-12-04T09:58:53.8671843Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8672336Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.8672390Z current_size = base.storage().size() 2025-12-04T09:58:53.8672433Z Autotune Choices Stats: 2025-12-04T09:58:53.8673177Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.8673318Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8673439Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8673600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8674214Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8674821Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8675421Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8676092Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8676692Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8677290Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8677902Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8678501Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8679099Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8679699Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8679843Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.8679888Z Autotune Choices Stats: 2025-12-04T09:58:53.8680665Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.8680886Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8681054Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8681331Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8681959Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8682598Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8683219Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8683842Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8684492Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8685128Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8685750Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8686416Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8687061Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8687682Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8687811Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.8687888Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8687933Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8687976Z unimplemented [] 2025-12-04T09:58:53.8688039Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8688159Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8688752Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8688793Z graph_break [] 2025-12-04T09:58:53.8688883Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8688925Z Autotune Choices Stats: 2025-12-04T09:58:53.8689666Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.8689794Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8689911Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8690076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8690697Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8691303Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8691905Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8692503Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8693144Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8693747Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8694348Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8694959Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8695567Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8696216Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8696346Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.8696389Z Autotune Choices Stats: 2025-12-04T09:58:53.8697163Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.8697414Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8697583Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8697862Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8698492Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8699114Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8699752Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8700371Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8700997Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8701659Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8702279Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8702905Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8703537Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8704165Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8704296Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.8704370Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8704417Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8704456Z unimplemented [] 2025-12-04T09:58:53.8704521Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8704620Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8705196Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.8705249Z graph_break [] 2025-12-04T09:58:53.8705326Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8705367Z Autotune Choices Stats: 2025-12-04T09:58:53.8706285Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.8706417Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8706532Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8706697Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8707306Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8707916Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8708520Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8709121Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8709722Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8710364Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8710970Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8711569Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8712178Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8712784Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8712918Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.8712959Z Autotune Choices Stats: 2025-12-04T09:58:53.8713712Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.8713939Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8714122Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8714410Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8715040Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8715666Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8716340Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8716962Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8717596Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8718216Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8718877Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8719510Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8720133Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8720769Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8720898Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.8720973Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8721017Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8721055Z unimplemented [] 2025-12-04T09:58:53.8721120Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8721220Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8721792Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8721829Z graph_break [] 2025-12-04T09:58:53.8721906Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8721946Z Autotune Choices Stats: 2025-12-04T09:58:53.8722693Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.8722834Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8722957Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8723118Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8723731Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8724329Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8724941Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8725539Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8726201Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8726800Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8727447Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8728048Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8728650Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8729265Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8729397Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.8729438Z Autotune Choices Stats: 2025-12-04T09:58:53.8730192Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.8730413Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8730579Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8730874Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8731522Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8732144Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8732771Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8733404Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8734028Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8734661Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8735281Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8735985Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8736609Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8737235Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8737381Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.8737456Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8737503Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8737544Z unimplemented [] 2025-12-04T09:58:53.8737608Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8737707Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8738283Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8738325Z graph_break [] 2025-12-04T09:58:53.8738399Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8738444Z Autotune Choices Stats: 2025-12-04T09:58:53.8739186Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.8739330Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8739445Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8739622Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8740241Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8740842Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8741448Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8742060Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8742658Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8743261Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8743878Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8744501Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8745103Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8745709Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8745858Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.8745900Z Autotune Choices Stats: 2025-12-04T09:58:53.8746686Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.8746906Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8747072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8747354Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8747988Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8748642Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8749268Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8749894Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8750533Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8751155Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8751784Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8752419Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8753066Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8753686Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8753817Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.8753895Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8753939Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8753980Z unimplemented [] 2025-12-04T09:58:53.8754053Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8754157Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8754732Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8754772Z graph_break [] 2025-12-04T09:58:53.8754849Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8754892Z Autotune Choices Stats: 2025-12-04T09:58:53.8755626Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:53.8755757Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8755876Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8756064Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8756701Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8757312Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8757916Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8758522Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8759139Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8759739Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8760343Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8760961Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8761593Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8762191Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8762325Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:53.8762370Z Autotune Choices Stats: 2025-12-04T09:58:53.8763129Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:53.8763358Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8763523Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8763807Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8764439Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8765080Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8765726Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8766373Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8766997Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8767659Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8768280Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8768908Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8769555Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8770203Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8770335Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:53.8770411Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8770456Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8770498Z unimplemented [] 2025-12-04T09:58:53.8770561Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8770665Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8771238Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.8771291Z graph_break [] 2025-12-04T09:58:53.8771366Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8771409Z Autotune Choices Stats: 2025-12-04T09:58:53.8772153Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:53.8772282Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8772398Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8772559Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8773168Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8773791Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8774408Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8775010Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8775611Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8776272Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8776873Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8777476Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8778116Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8778740Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8778872Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:53.8778916Z Autotune Choices Stats: 2025-12-04T09:58:53.8779672Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:53.8779904Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8780071Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8780349Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8780982Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8781603Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8782228Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8782879Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8783506Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8784132Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8784767Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8785395Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8786081Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8786731Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8786860Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:53.8786952Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8786996Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8787037Z unimplemented [] 2025-12-04T09:58:53.8787098Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8787201Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8787773Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8787814Z graph_break [] 2025-12-04T09:58:53.8787891Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8787935Z Autotune Choices Stats: 2025-12-04T09:58:53.8788673Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:53.8788814Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8788931Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8789093Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8789709Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8790311Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8790931Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8791550Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8792151Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8792751Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8793363Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8793971Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8794576Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8795203Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8795332Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:53.8795387Z Autotune Choices Stats: 2025-12-04T09:58:53.8796168Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:53.8796387Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8796556Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8796851Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8797489Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8798117Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8798747Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8799391Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8800033Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8800668Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8801290Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8801925Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8802555Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8803179Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8803320Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:53.8803399Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8803443Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8803503Z unimplemented [] 2025-12-04T09:58:53.8803565Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8803669Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8804252Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8804292Z graph_break [] 2025-12-04T09:58:53.8804370Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8804412Z Autotune Choices Stats: 2025-12-04T09:58:53.8805160Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:53.8805300Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8805416Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8805582Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8806224Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8806831Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8807440Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8808068Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8808684Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8809295Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8809897Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8810513Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8811122Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8811724Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8811865Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:53.8811910Z Autotune Choices Stats: 2025-12-04T09:58:53.8812692Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.8812909Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8813079Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8813363Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8813991Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8814627Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8815248Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8815872Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8816571Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8817208Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8817844Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8818464Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8819104Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8819729Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8819860Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:53.8819936Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8819983Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8820021Z unimplemented [] 2025-12-04T09:58:53.8820084Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8820200Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8820791Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8820830Z graph_break [] 2025-12-04T09:58:53.8820919Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8820961Z Autotune Choices Stats: 2025-12-04T09:58:53.8821699Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:53.8821830Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8821946Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8822113Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8822735Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8823339Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8823943Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8824549Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8825187Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8825792Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8826435Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8827033Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8827650Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8828254Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8828385Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:53.8828427Z Autotune Choices Stats: 2025-12-04T09:58:53.8829202Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:53.8829433Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8829619Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8829905Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8830534Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8831160Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8831795Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8832419Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8833046Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8833693Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8834326Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8834960Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8835585Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8836264Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8836396Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:53.8836472Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8836519Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8836559Z unimplemented [] 2025-12-04T09:58:53.8836624Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8836724Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8837304Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8837360Z graph_break [] 2025-12-04T09:58:53.8837437Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8837480Z Autotune Choices Stats: 2025-12-04T09:58:53.8838263Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:53.8838394Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8838509Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8838677Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8839291Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8839914Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8840517Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8841118Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8841719Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8842356Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8842961Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8843564Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8844179Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8844781Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8844914Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:53.8844956Z Autotune Choices Stats: 2025-12-04T09:58:53.8845718Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:53.8845977Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8846181Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8846462Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8847102Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8847728Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8848353Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8848991Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8849615Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8850239Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8850895Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8851522Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8852145Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8852782Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8852913Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:53.8852989Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8853035Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8853074Z unimplemented [] 2025-12-04T09:58:53.8853138Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8853240Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8853815Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8853858Z graph_break [] 2025-12-04T09:58:53.8853934Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8853978Z Autotune Choices Stats: 2025-12-04T09:58:53.8854730Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:53.8854870Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8854994Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8855158Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8855776Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8856403Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8857027Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8857632Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8858232Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8858834Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8859499Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8860101Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8860705Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8861326Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8861458Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:53.8861500Z Autotune Choices Stats: 2025-12-04T09:58:53.8862259Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:53.8862479Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8862645Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8862934Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8863598Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8864219Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8864846Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8865484Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8866159Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8866783Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8867416Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8868083Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8868710Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8869340Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8869485Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:53.8869563Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8869607Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8869647Z unimplemented [] 2025-12-04T09:58:53.8869709Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8869811Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8870387Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.8870430Z graph_break [] 2025-12-04T09:58:53.8870504Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8870550Z Autotune Choices Stats: 2025-12-04T09:58:53.8871292Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:53.8871435Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8871552Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8871727Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8872347Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8872962Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8873559Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8874179Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8874780Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8875384Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8876036Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8876660Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8877266Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8877866Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8878011Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:53.8878056Z Autotune Choices Stats: 2025-12-04T09:58:53.8878810Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.8879033Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8879211Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8879487Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8880116Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8880771Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8881392Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8882022Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8882659Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8883292Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8883914Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8884540Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8885204Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8885829Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8885997Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:53.8886076Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8886120Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8886162Z unimplemented [] 2025-12-04T09:58:53.8886223Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8886348Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8886916Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8886958Z graph_break [] 2025-12-04T09:58:53.8887036Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8887080Z Autotune Choices Stats: 2025-12-04T09:58:53.8887823Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:53.8887951Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8888068Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8888229Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8888855Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8889490Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8890092Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8890692Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8891313Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8891913Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8892513Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8893130Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8893752Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8894355Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8894483Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:53.8894530Z Autotune Choices Stats: 2025-12-04T09:58:53.8895289Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:53.8895525Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8895696Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8896007Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8896640Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8897268Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8897929Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8898552Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8899189Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8899838Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8900457Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8901088Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8901734Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8902378Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8902508Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:53.8902605Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.8902654Z Traceback (most recent call last): 2025-12-04T09:58:53.8902809Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.8902854Z self.assertTrue( 2025-12-04T09:58:53.8902962Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.8903012Z raise self.failureException(msg) 2025-12-04T09:58:53.8903143Z AssertionError: False is not true : Log file /tmp/tmpthqoqawy/flex_attention_configs.json was not created 2025-12-04T09:58:53.8903145Z 2025-12-04T09:58:53.8903221Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.8903400Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.8903403Z 2025-12-04T09:58:53.8903496Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.8903572Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8903620Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8903660Z unimplemented [] 2025-12-04T09:58:53.8903724Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8904303Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.8904407Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8904445Z graph_break [] 2025-12-04T09:58:53.8904523Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8905014Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.8905068Z current_size = base.storage().size() 2025-12-04T09:58:53.8905110Z Autotune Choices Stats: 2025-12-04T09:58:53.8905864Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.8906043Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8906180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8906344Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8906960Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8907569Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8908195Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8908797Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8909400Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8909998Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8910654Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8911260Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8911862Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8912476Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8912610Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.8912653Z Autotune Choices Stats: 2025-12-04T09:58:53.8913408Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.8913630Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8913796Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8914089Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8914739Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8915368Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8916023Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8916665Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8917290Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8917916Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8918538Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8919210Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8919833Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8920457Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8920599Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.8920674Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8920723Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8920763Z unimplemented [] 2025-12-04T09:58:53.8920828Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8920928Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8921511Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8921555Z graph_break [] 2025-12-04T09:58:53.8921630Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8921673Z Autotune Choices Stats: 2025-12-04T09:58:53.8922409Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.8922560Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8922673Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8922847Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8923467Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8924065Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8924676Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8925292Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8925895Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8926525Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8927133Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8927779Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8928378Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8928982Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8929129Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.8929172Z Autotune Choices Stats: 2025-12-04T09:58:53.8929924Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.8930146Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8930313Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8930597Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8931229Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8931879Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8932499Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8933123Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8933762Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8934382Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8935012Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8935638Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8936333Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8936951Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8937083Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.8937163Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8937207Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8937249Z unimplemented [] 2025-12-04T09:58:53.8937311Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8937437Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8938011Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.8938052Z graph_break [] 2025-12-04T09:58:53.8938130Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8938175Z Autotune Choices Stats: 2025-12-04T09:58:53.8938910Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.8939042Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8939159Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8939323Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8939951Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8940576Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8941180Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8941779Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8942391Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8942990Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8943595Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8944211Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8944831Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8945438Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8945570Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.8945614Z Autotune Choices Stats: 2025-12-04T09:58:53.8946420Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.8946662Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8946830Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8947111Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8947755Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8948377Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8949043Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8949667Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8950294Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8950936Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8951556Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8952186Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8952828Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8953479Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8953609Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.8953687Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8953731Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8953772Z unimplemented [] 2025-12-04T09:58:53.8953834Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8953937Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8954509Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8954564Z graph_break [] 2025-12-04T09:58:53.8954639Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8954685Z Autotune Choices Stats: 2025-12-04T09:58:53.8955433Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.8955562Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8955681Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8955842Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8956502Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8957136Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8957753Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8958357Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8958968Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8959583Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8960188Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8960790Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8961401Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8962025Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8962157Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.8962204Z Autotune Choices Stats: 2025-12-04T09:58:53.8962959Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.8963192Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8963362Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8963639Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8964275Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8964899Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8965530Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8966219Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8966851Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8967476Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8968113Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8968741Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8969368Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8970006Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8970161Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.8970251Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8970295Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8970337Z unimplemented [] 2025-12-04T09:58:53.8970400Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8970503Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8971076Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8971119Z graph_break [] 2025-12-04T09:58:53.8971196Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8971240Z Autotune Choices Stats: 2025-12-04T09:58:53.8971983Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.8972123Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8972240Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8972401Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8973014Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8973617Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8974244Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8974856Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8975457Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8976092Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8976711Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8977310Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8977917Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8978530Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8978674Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.8978718Z Autotune Choices Stats: 2025-12-04T09:58:53.8979490Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.8979708Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8979883Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8980171Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8980801Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8981425Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8982054Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8982690Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8983334Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8983964Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8984589Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8985220Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8985851Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8986508Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8986655Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.8986734Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.8986778Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.8986819Z unimplemented [] 2025-12-04T09:58:53.8986893Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.8986997Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.8987585Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.8987625Z graph_break [] 2025-12-04T09:58:53.8987703Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.8987744Z Autotune Choices Stats: 2025-12-04T09:58:53.8988483Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:53.8988622Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8988740Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8988905Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8989516Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8990122Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8990731Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8991358Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8991974Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8992578Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8993177Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.8993796Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8994399Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8995003Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8995145Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:53.8995186Z Autotune Choices Stats: 2025-12-04T09:58:53.8996024Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:53.8996240Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.8996409Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.8996689Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.8997315Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8997956Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8998578Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8999201Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.8999855Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9000491Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9001115Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9001738Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9002386Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9003010Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9003142Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:53.9003218Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9003265Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9003304Z unimplemented [] 2025-12-04T09:58:53.9003368Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9003468Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9004065Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.9004104Z graph_break [] 2025-12-04T09:58:53.9004182Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9004235Z Autotune Choices Stats: 2025-12-04T09:58:53.9004969Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:53.9005102Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9005217Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9005383Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9006037Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9006662Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9007267Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9007872Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9008498Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9009111Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9009721Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9010325Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9010936Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9011548Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9011682Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:53.9011723Z Autotune Choices Stats: 2025-12-04T09:58:53.9016111Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:53.9016356Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9016543Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9016821Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9017459Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9018078Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9018714Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9019345Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9019972Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9020617Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9021257Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9021887Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9022510Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9023143Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9023276Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:53.9023353Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9023398Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9023437Z unimplemented [] 2025-12-04T09:58:53.9023502Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9023604Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9024179Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9024227Z graph_break [] 2025-12-04T09:58:53.9024305Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9024345Z Autotune Choices Stats: 2025-12-04T09:58:53.9025101Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:53.9025233Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9025350Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9025513Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9026170Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9026789Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9027396Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9028001Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9028598Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9029238Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9029839Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9030443Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9031045Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9031659Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9031792Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:53.9031833Z Autotune Choices Stats: 2025-12-04T09:58:53.9032587Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:53.9032808Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9032999Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9033277Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9033916Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9034541Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9035167Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9035800Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9036473Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9037099Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9037756Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9038398Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9039027Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9039648Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9039793Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:53.9039868Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9039913Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9039951Z unimplemented [] 2025-12-04T09:58:53.9040013Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9040114Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9040690Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9040728Z graph_break [] 2025-12-04T09:58:53.9040802Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9040844Z Autotune Choices Stats: 2025-12-04T09:58:53.9041591Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:53.9041731Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9041858Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9042022Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9042634Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9043234Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9043848Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9044448Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9045056Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9045650Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9046351Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9046954Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9047556Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9048168Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9048299Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:53.9048339Z Autotune Choices Stats: 2025-12-04T09:58:53.9049097Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.9049316Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9049482Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9049762Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9050427Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9051050Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9051675Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9052299Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9052937Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9053561Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9054185Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9054853Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9055480Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9056137Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9056285Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:53.9056360Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9056402Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9056441Z unimplemented [] 2025-12-04T09:58:53.9056503Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9056604Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9057182Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9057223Z graph_break [] 2025-12-04T09:58:53.9057296Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9057339Z Autotune Choices Stats: 2025-12-04T09:58:53.9058083Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:53.9058213Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9058348Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9058527Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9059147Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9059747Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9060348Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9060961Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9061560Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9062169Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9062778Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9063409Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9064010Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9064612Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9064753Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:53.9064795Z Autotune Choices Stats: 2025-12-04T09:58:53.9065558Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:53.9065782Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9065975Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9066251Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9066887Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9067545Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9068159Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9068790Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9069435Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9070057Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9070681Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9071307Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9071958Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9072584Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9072713Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:53.9072789Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9072833Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9072872Z unimplemented [] 2025-12-04T09:58:53.9072933Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9073035Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9073617Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9073655Z graph_break [] 2025-12-04T09:58:53.9073728Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9073769Z Autotune Choices Stats: 2025-12-04T09:58:53.9074517Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:53.9074645Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9074758Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9074920Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9075539Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9076196Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9076799Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9077399Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9078009Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9078611Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9079216Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9079818Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9080471Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9081073Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9081202Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:53.9081244Z Autotune Choices Stats: 2025-12-04T09:58:53.9082009Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:53.9082248Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9082415Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9082690Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9083318Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9083938Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9084597Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9085219Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9085846Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9086524Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9087143Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9087771Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9088400Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9089065Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9089193Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:53.9089270Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9089312Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9089352Z unimplemented [] 2025-12-04T09:58:53.9089413Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9089514Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9090087Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9090140Z graph_break [] 2025-12-04T09:58:53.9090216Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9090259Z Autotune Choices Stats: 2025-12-04T09:58:53.9090996Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:53.9091123Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9091238Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9091397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9092009Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9092624Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9093244Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9093852Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9094456Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9095072Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9095672Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9096315Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9096928Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9097549Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9097679Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:53.9097720Z Autotune Choices Stats: 2025-12-04T09:58:53.9098478Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:53.9098694Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9098874Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9099150Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9099776Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9100406Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9101026Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9101685Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9102315Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9102941Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9103573Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9104195Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9104824Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9105454Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9105595Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:53.9105669Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9105721Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9105760Z unimplemented [] 2025-12-04T09:58:53.9105820Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9105966Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9106548Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.9106586Z graph_break [] 2025-12-04T09:58:53.9106661Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9106701Z Autotune Choices Stats: 2025-12-04T09:58:53.9107439Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:53.9107587Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9107702Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9107864Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9108477Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9109083Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9109715Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9110330Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9110930Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9111531Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9112149Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9112752Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9113354Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9113970Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9114115Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:53.9114162Z Autotune Choices Stats: 2025-12-04T09:58:53.9114926Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.9115149Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9115321Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9115600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9116282Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9116906Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9117529Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9118184Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9118838Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9119466Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9120092Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9120744Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9121369Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9122005Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9122136Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:53.9122226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9122270Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9122308Z unimplemented [] 2025-12-04T09:58:53.9122371Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9122481Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9123064Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9123104Z graph_break [] 2025-12-04T09:58:53.9123179Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9123220Z Autotune Choices Stats: 2025-12-04T09:58:53.9123967Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:53.9124096Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9124226Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9124387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9124997Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9125730Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9126369Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9127006Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9127619Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9128231Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9128829Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9129446Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9130043Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9130647Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9130789Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:53.9130831Z Autotune Choices Stats: 2025-12-04T09:58:53.9131606Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:53.9131823Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9131995Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9132276Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9132906Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9133548Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9134182Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9134808Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9135444Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9136131Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9136756Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9137386Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9138036Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9138660Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9138791Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:53.9138867Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9138912Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9138950Z unimplemented [] 2025-12-04T09:58:53.9139013Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9139112Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9139714Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9139752Z graph_break [] 2025-12-04T09:58:53.9139829Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9139869Z Autotune Choices Stats: 2025-12-04T09:58:53.9140629Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:53.9140760Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9140874Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9141037Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9141649Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9142270Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9142875Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9143481Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9144111Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9144721Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9145324Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9145974Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9146589Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9147190Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9147326Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:53.9147367Z Autotune Choices Stats: 2025-12-04T09:58:53.9148126Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:53.9148376Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9148559Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9148839Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9149468Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9150100Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9150736Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9151358Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9151983Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9152630Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9153258Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9153885Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9154513Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9155144Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9155275Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:53.9155370Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.9155421Z Traceback (most recent call last): 2025-12-04T09:58:53.9155574Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.9155620Z self.assertTrue( 2025-12-04T09:58:53.9155724Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.9155777Z raise self.failureException(msg) 2025-12-04T09:58:53.9155905Z AssertionError: False is not true : Log file /tmp/tmp6db4qp8j/flex_attention_configs.json was not created 2025-12-04T09:58:53.9155911Z 2025-12-04T09:58:53.9156026Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.9156194Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.9156213Z 2025-12-04T09:58:53.9156304Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.9156381Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9156427Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9156468Z unimplemented [] 2025-12-04T09:58:53.9156545Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9157131Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.9157233Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9157274Z graph_break [] 2025-12-04T09:58:53.9157349Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9157846Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.9157895Z current_size = base.storage().size() 2025-12-04T09:58:53.9157939Z Autotune Choices Stats: 2025-12-04T09:58:53.9158686Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.9158830Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9158946Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9159108Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9159722Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9160329Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9160945Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9161567Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9162167Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9162771Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9163389Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9163992Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9164595Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9165203Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9165351Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.9165395Z Autotune Choices Stats: 2025-12-04T09:58:53.9166205Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.9166427Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9166595Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9166871Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9167517Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9168141Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9168760Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9169398Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9170049Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9170673Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9171291Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9171933Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9172556Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9173178Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9173309Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.9173400Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9173443Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9173484Z unimplemented [] 2025-12-04T09:58:53.9173546Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9173658Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9174242Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9174281Z graph_break [] 2025-12-04T09:58:53.9174356Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9174396Z Autotune Choices Stats: 2025-12-04T09:58:53.9175132Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.9175260Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9175391Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9175552Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9176195Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9176796Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9177399Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9178033Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9178647Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9179247Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9179851Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9180464Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9181063Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9181664Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9181794Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.9181867Z Autotune Choices Stats: 2025-12-04T09:58:53.9182643Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.9182860Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9183028Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9183306Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9183940Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9184573Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9185196Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9185827Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9186503Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9187151Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9187781Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9188403Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9189040Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9189672Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9189804Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.9189878Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9189924Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9189963Z unimplemented [] 2025-12-04T09:58:53.9190026Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9190126Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9190722Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.9190761Z graph_break [] 2025-12-04T09:58:53.9190835Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9190876Z Autotune Choices Stats: 2025-12-04T09:58:53.9191633Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.9191768Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9191884Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9192047Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9192653Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9193267Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9193870Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9194465Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9195097Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9195708Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9196349Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9196948Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9197584Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9198184Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9198317Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.9198358Z Autotune Choices Stats: 2025-12-04T09:58:53.9199115Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.9199361Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9199542Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9199820Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9200450Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9201077Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9201713Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9202337Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9202962Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9203613Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9204245Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9204871Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9205503Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9206172Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9206302Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.9206378Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9206424Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9206462Z unimplemented [] 2025-12-04T09:58:53.9206526Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9206625Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9207200Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9207254Z graph_break [] 2025-12-04T09:58:53.9207330Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9207371Z Autotune Choices Stats: 2025-12-04T09:58:53.9208132Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.9208262Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9208377Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9208537Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9209149Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9209754Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9210370Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9210972Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9211581Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9212200Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9212817Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9213430Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9214026Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9214641Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9214772Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.9214814Z Autotune Choices Stats: 2025-12-04T09:58:53.9215575Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.9215794Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9216004Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9216297Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9216945Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9217570Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9218193Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9218830Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9219456Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9220081Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9220731Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9221366Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9221993Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9222616Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9222762Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.9222836Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9222882Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9222921Z unimplemented [] 2025-12-04T09:58:53.9222984Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9223085Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9223659Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9223699Z graph_break [] 2025-12-04T09:58:53.9223772Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9223814Z Autotune Choices Stats: 2025-12-04T09:58:53.9224558Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.9224709Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9224823Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9224995Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9225605Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9226244Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9226868Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9227470Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9228072Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9228678Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9229329Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9229930Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9230535Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9231136Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9231280Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.9231321Z Autotune Choices Stats: 2025-12-04T09:58:53.9232076Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.9232297Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9232462Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9232740Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9233388Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9234018Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9234641Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9235264Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9235902Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9236566Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9237190Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9237855Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9238484Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9239107Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9239238Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.9239329Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9239375Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9239415Z unimplemented [] 2025-12-04T09:58:53.9239477Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9239576Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9240148Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9240189Z graph_break [] 2025-12-04T09:58:53.9240266Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9240312Z Autotune Choices Stats: 2025-12-04T09:58:53.9241051Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:53.9241180Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9241307Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9241468Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9242105Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9242708Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9243312Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9243926Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9244537Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9245138Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9245738Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9246421Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9247025Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9247625Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9247768Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:53.9247813Z Autotune Choices Stats: 2025-12-04T09:58:53.9248581Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:53.9248800Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9248966Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9249243Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9249874Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9250534Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9251155Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9251783Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9252415Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9253051Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9253673Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9254307Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9254964Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9255585Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9255716Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:53.9255794Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9255838Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9255879Z unimplemented [] 2025-12-04T09:58:53.9255979Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9256080Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9256674Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.9256714Z graph_break [] 2025-12-04T09:58:53.9256788Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9256831Z Autotune Choices Stats: 2025-12-04T09:58:53.9257564Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:53.9257694Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9257811Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9257973Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9258582Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9259219Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9259817Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9260422Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9261045Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9261648Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9262255Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9262857Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9263491Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9264097Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9264228Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:53.9264271Z Autotune Choices Stats: 2025-12-04T09:58:53.9265025Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:53.9265257Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9265423Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9265699Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9266363Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9267067Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9267730Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9268361Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9268994Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9269633Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9270258Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9270888Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9271515Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9272166Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9272295Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:53.9272374Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9272418Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9272461Z unimplemented [] 2025-12-04T09:58:53.9272526Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9272633Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9273205Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9273258Z graph_break [] 2025-12-04T09:58:53.9273337Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9273376Z Autotune Choices Stats: 2025-12-04T09:58:53.9274125Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:53.9274253Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9274371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9274531Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9275144Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9275756Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9276426Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9277028Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9277631Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9278251Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9278850Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9279457Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9280058Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9280698Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9280828Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:53.9280873Z Autotune Choices Stats: 2025-12-04T09:58:53.9281633Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:53.9281848Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9282027Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9282307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9282942Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9283568Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9284193Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9284847Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9285475Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9286138Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9286784Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9287409Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9288034Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9288683Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9288824Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:53.9288903Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9288946Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9289000Z unimplemented [] 2025-12-04T09:58:53.9289060Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9289163Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9289736Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9289775Z graph_break [] 2025-12-04T09:58:53.9289850Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9289891Z Autotune Choices Stats: 2025-12-04T09:58:53.9290635Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:53.9290775Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9290892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9291056Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9291666Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9292275Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9292895Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9293523Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9294128Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9294737Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9295352Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9295986Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9296710Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9297324Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9297467Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:53.9297511Z Autotune Choices Stats: 2025-12-04T09:58:53.9298284Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.9298503Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9298676Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9298955Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9299606Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9300226Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9300858Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9301495Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9302142Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9302772Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9303398Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9304028Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9304654Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9305284Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9305413Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:53.9305499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9305544Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9305581Z unimplemented [] 2025-12-04T09:58:53.9305643Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9305752Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9306375Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9306414Z graph_break [] 2025-12-04T09:58:53.9306490Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9306530Z Autotune Choices Stats: 2025-12-04T09:58:53.9307268Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:53.9307396Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9307523Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9307687Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9308304Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9308913Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9309516Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9310154Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9310762Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9311365Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9311966Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9312582Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9313181Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9313782Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9313914Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:53.9313967Z Autotune Choices Stats: 2025-12-04T09:58:53.9314752Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:53.9314968Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9315136Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9315412Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9316077Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9316720Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9317342Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9317973Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9318608Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9319260Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9319884Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9320510Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9321144Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9321770Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9321905Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:53.9321979Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9322024Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9322062Z unimplemented [] 2025-12-04T09:58:53.9322124Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9322224Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9322813Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9322860Z graph_break [] 2025-12-04T09:58:53.9322935Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9322974Z Autotune Choices Stats: 2025-12-04T09:58:53.9323744Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:53.9323875Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9323994Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9324158Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9324766Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9325388Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9326023Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9326623Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9327248Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9327866Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9328481Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9329081Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9329720Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9330319Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9330449Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:53.9330490Z Autotune Choices Stats: 2025-12-04T09:58:53.9331245Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:53.9331489Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9331666Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9331948Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9332577Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9333207Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9333845Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9334471Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9335098Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9335743Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9336435Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9337064Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9337692Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9338339Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9338469Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:53.9338546Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9338590Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9338627Z unimplemented [] 2025-12-04T09:58:53.9338690Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9338791Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9339371Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9339412Z graph_break [] 2025-12-04T09:58:53.9339509Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9339552Z Autotune Choices Stats: 2025-12-04T09:58:53.9340338Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:53.9340466Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9340582Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9340744Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9341360Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9341963Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9342578Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9343183Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9343784Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9344419Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9345035Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9345644Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9346270Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9346891Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9347023Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:53.9347065Z Autotune Choices Stats: 2025-12-04T09:58:53.9347827Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:53.9348045Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9348229Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9348521Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9349169Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9349794Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9350418Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9351052Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9351682Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9352309Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9352954Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9353597Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9354223Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9354844Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9354987Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:53.9355065Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9355109Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9355150Z unimplemented [] 2025-12-04T09:58:53.9355211Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9355313Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9355888Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.9355960Z graph_break [] 2025-12-04T09:58:53.9356041Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9356085Z Autotune Choices Stats: 2025-12-04T09:58:53.9356820Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:53.9356994Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9357113Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9357293Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9357901Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9358507Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9359108Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9359726Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9360337Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9360940Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9361569Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9362184Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9362789Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9363394Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9363536Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:53.9363579Z Autotune Choices Stats: 2025-12-04T09:58:53.9364339Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.9364559Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9364727Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9365008Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9365668Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9366351Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9366973Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9367610Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9368264Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9368899Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9369521Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9370194Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9370828Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9371458Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9371586Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:53.9371675Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9371718Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9371757Z unimplemented [] 2025-12-04T09:58:53.9371819Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9371923Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9372498Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9372540Z graph_break [] 2025-12-04T09:58:53.9372614Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9372657Z Autotune Choices Stats: 2025-12-04T09:58:53.9373417Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:53.9373547Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9373664Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9373838Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9374471Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9375079Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9375682Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9376331Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9376942Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9377548Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9378149Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9378802Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9379409Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9380018Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9380147Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:53.9380205Z Autotune Choices Stats: 2025-12-04T09:58:53.9380963Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:53.9381184Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9381354Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9381632Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9382267Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9382911Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9383548Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9384182Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9384815Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9385454Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9386102Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9386732Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9387402Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9388034Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9388164Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:53.9388242Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9388285Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9388326Z unimplemented [] 2025-12-04T09:58:53.9388387Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9388487Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9389063Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9389122Z graph_break [] 2025-12-04T09:58:53.9389198Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9389239Z Autotune Choices Stats: 2025-12-04T09:58:53.9389975Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:53.9390105Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9390219Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9390379Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9390988Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9391620Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9392221Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9392824Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9393440Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9394039Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9394652Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9395258Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9395896Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9396536Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9396668Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:53.9396712Z Autotune Choices Stats: 2025-12-04T09:58:53.9397473Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:53.9397709Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9397879Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9398156Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9398797Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9399423Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9400080Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9400710Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9401339Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9401977Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9402606Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9403234Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9403863Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9404524Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9404657Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:53.9404735Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9404779Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9404820Z unimplemented [] 2025-12-04T09:58:53.9404882Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9404985Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9405563Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9405601Z graph_break [] 2025-12-04T09:58:53.9405688Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9405733Z Autotune Choices Stats: 2025-12-04T09:58:53.9406518Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:53.9406646Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9406761Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9406923Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9407540Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9408162Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9408792Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9409392Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9409996Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9410619Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9411233Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9411842Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9412442Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9413074Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9413206Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:53.9413249Z Autotune Choices Stats: 2025-12-04T09:58:53.9414012Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.9414231Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9414410Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9414690Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9415320Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9415977Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9416604Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9417264Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9417897Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9418528Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9419166Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9419794Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9420430Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9421056Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9421215Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:53.9421310Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.9421363Z Traceback (most recent call last): 2025-12-04T09:58:53.9421527Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.9421572Z self.assertTrue( 2025-12-04T09:58:53.9421678Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.9421734Z raise self.failureException(msg) 2025-12-04T09:58:53.9421863Z AssertionError: False is not true : Log file /tmp/tmpox4mtzl8/flex_attention_configs.json was not created 2025-12-04T09:58:53.9421866Z 2025-12-04T09:58:53.9421944Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.9422109Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.9422112Z 2025-12-04T09:58:53.9422206Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.9422282Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9422331Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9422371Z unimplemented [] 2025-12-04T09:58:53.9422439Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9423023Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.9423137Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9423180Z graph_break [] 2025-12-04T09:58:53.9423256Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9423748Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.9423797Z current_size = base.storage().size() 2025-12-04T09:58:53.9423840Z Autotune Choices Stats: 2025-12-04T09:58:53.9424580Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.9424708Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9424827Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9425000Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9425640Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9426280Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9426893Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9427515Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9428119Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9428719Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9429327Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9429966Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9430568Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9431174Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9431306Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.9431365Z Autotune Choices Stats: 2025-12-04T09:58:53.9432122Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.9432343Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9432511Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9432798Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9433423Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9434064Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9434698Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9435327Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9435989Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9436626Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9437256Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9437885Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9438554Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9439176Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9439311Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.9439385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9439429Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9439468Z unimplemented [] 2025-12-04T09:58:53.9439530Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9439630Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9440202Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9440253Z graph_break [] 2025-12-04T09:58:53.9440325Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9440367Z Autotune Choices Stats: 2025-12-04T09:58:53.9441107Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.9441238Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9441354Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9441515Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9442132Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9442772Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9443369Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9443972Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9444590Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9445192Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9445790Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9446427Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9447066Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9447661Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9447792Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.9447837Z Autotune Choices Stats: 2025-12-04T09:58:53.9448598Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.9448835Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9449003Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9449280Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9449910Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9450537Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9451196Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9451817Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9452443Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9453091Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9453709Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9454341Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9454975Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9455631Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9455759Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.9455835Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9455880Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9455960Z unimplemented [] 2025-12-04T09:58:53.9456021Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9456121Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9456702Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.9456741Z graph_break [] 2025-12-04T09:58:53.9456829Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9456873Z Autotune Choices Stats: 2025-12-04T09:58:53.9457612Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.9457741Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9457858Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9458019Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9458637Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9459238Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9459874Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9460481Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9461084Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9461698Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9462296Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9462902Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9463503Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9464134Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9464265Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.9464309Z Autotune Choices Stats: 2025-12-04T09:58:53.9465072Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.9465293Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9465474Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9465755Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9466424Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9467050Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9467672Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9468339Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9468974Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9469602Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9470244Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9470876Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9471505Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9472131Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9472284Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.9472361Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9472405Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9472448Z unimplemented [] 2025-12-04T09:58:53.9472521Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9472626Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9473205Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9473245Z graph_break [] 2025-12-04T09:58:53.9473322Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9473363Z Autotune Choices Stats: 2025-12-04T09:58:53.9474104Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.9474246Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9474360Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9474522Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9475133Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9475736Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9476393Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9477016Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9477623Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9478233Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9478850Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9479453Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9480056Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9480660Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9480812Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.9480854Z Autotune Choices Stats: 2025-12-04T09:58:53.9481622Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.9481842Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9482010Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9482290Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9482932Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9483555Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9484178Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9484796Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9485461Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9486125Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9486758Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9487398Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9488026Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9488661Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9488794Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.9488870Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9488928Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9488967Z unimplemented [] 2025-12-04T09:58:53.9489032Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9489132Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9489731Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9489769Z graph_break [] 2025-12-04T09:58:53.9489846Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9489892Z Autotune Choices Stats: 2025-12-04T09:58:53.9490639Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.9490769Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9490883Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9491061Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9491669Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9492279Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9492885Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9493499Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9494117Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9494725Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9495328Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9495975Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9496579Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9497188Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9497320Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.9497364Z Autotune Choices Stats: 2025-12-04T09:58:53.9498155Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.9498386Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9498553Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9498831Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9499462Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9500108Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9500735Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9501365Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9502020Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9502666Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9503295Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9503920Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9504559Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9505187Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9505318Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.9505392Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9505435Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9505473Z unimplemented [] 2025-12-04T09:58:53.9505537Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9505636Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9506247Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9506315Z graph_break [] 2025-12-04T09:58:53.9506393Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9506434Z Autotune Choices Stats: 2025-12-04T09:58:53.9507181Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:53.9507310Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9507424Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9507586Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9508195Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9508812Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9509414Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9510022Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9510643Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9511255Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9511858Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9512461Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9513073Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9513684Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9513817Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:53.9513859Z Autotune Choices Stats: 2025-12-04T09:58:53.9514616Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:53.9514855Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9515019Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9515309Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9515980Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9516608Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9517253Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9517881Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9518506Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9519141Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9519793Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9522024Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9522651Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9523296Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9523429Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:53.9523507Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9523553Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9523592Z unimplemented [] 2025-12-04T09:58:53.9523656Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9523758Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9524328Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.9524369Z graph_break [] 2025-12-04T09:58:53.9524444Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9524501Z Autotune Choices Stats: 2025-12-04T09:58:53.9525266Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:53.9525400Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9525516Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9525678Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9526330Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9526943Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9527567Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9528168Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9528773Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9529402Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9530016Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9530635Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9531233Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9531846Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9531977Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:53.9532019Z Autotune Choices Stats: 2025-12-04T09:58:53.9532785Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:53.9533005Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9533170Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9533468Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9534113Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9534738Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9535361Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9539411Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9540039Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9540667Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9541306Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9541960Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9542586Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9543208Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9543355Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:53.9543430Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9543474Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9543511Z unimplemented [] 2025-12-04T09:58:53.9543575Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9543675Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9544244Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9544286Z graph_break [] 2025-12-04T09:58:53.9544361Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9544403Z Autotune Choices Stats: 2025-12-04T09:58:53.9545140Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:53.9545278Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9545404Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9545565Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9546208Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9546812Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9547412Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9548031Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9548632Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9549230Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9549857Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9550468Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9551070Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9551667Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9551810Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:53.9551851Z Autotune Choices Stats: 2025-12-04T09:58:53.9552614Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:53.9552834Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9552999Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9553276Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9553934Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9554582Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9555209Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9555837Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9556506Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9557134Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9557759Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9558410Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9559046Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9559665Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9559794Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:53.9559883Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9559926Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9559965Z unimplemented [] 2025-12-04T09:58:53.9560027Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9560127Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9560698Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9560737Z graph_break [] 2025-12-04T09:58:53.9560811Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9560855Z Autotune Choices Stats: 2025-12-04T09:58:53.9561603Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:53.9561731Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9561846Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9562017Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9562645Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9563246Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9563848Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9564445Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9565062Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9565666Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9566299Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9566930Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9567550Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9568150Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9568280Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:53.9568333Z Autotune Choices Stats: 2025-12-04T09:58:53.9569088Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.9569310Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9569475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9569752Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9570383Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9571029Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9571660Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9572283Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9572906Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9573554Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9574178Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9574798Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9575449Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9576208Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9576338Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:53.9576415Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9576457Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9576497Z unimplemented [] 2025-12-04T09:58:53.9576558Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9576658Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9577229Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9577282Z graph_break [] 2025-12-04T09:58:53.9577360Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9577401Z Autotune Choices Stats: 2025-12-04T09:58:53.9578141Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:53.9578268Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9578384Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9578543Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9579155Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9579792Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9580392Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9581003Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9581618Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9582221Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9582823Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9583426Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9584060Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9584659Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9584790Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:53.9584832Z Autotune Choices Stats: 2025-12-04T09:58:53.9585592Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:53.9585826Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9586031Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9586307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9586943Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9587567Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9588212Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9588854Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9589488Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9590112Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9590745Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9591375Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9591996Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9592647Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9592774Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:53.9592852Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9592896Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9592935Z unimplemented [] 2025-12-04T09:58:53.9592996Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9593095Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9593668Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9593706Z graph_break [] 2025-12-04T09:58:53.9593780Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9593832Z Autotune Choices Stats: 2025-12-04T09:58:53.9594571Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:53.9594699Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9594815Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9594978Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9595590Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9596226Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9596870Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9597469Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9598068Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9598680Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9599287Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9599882Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9600482Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9601125Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9601254Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:53.9601296Z Autotune Choices Stats: 2025-12-04T09:58:53.9602057Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:53.9602275Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9602439Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9602727Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9603360Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9603987Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9604613Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9605280Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9605905Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9606642Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9607286Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9607912Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9608537Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9609167Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9609322Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:53.9609397Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9609441Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9609479Z unimplemented [] 2025-12-04T09:58:53.9609540Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9609654Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9610227Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9610266Z graph_break [] 2025-12-04T09:58:53.9610342Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9610382Z Autotune Choices Stats: 2025-12-04T09:58:53.9611125Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:53.9611262Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9611376Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9611538Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9612146Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9612753Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9613356Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9613982Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9614584Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9615190Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9615799Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9616434Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9617041Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9617640Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9617805Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:53.9617846Z Autotune Choices Stats: 2025-12-04T09:58:53.9618626Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:53.9618845Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9619010Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9619288Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9619928Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9620554Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9621176Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9621796Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9622453Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9623079Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9623700Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9624339Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9624965Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9625589Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9625719Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:53.9625793Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9625848Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9625886Z unimplemented [] 2025-12-04T09:58:53.9625993Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9626092Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9626704Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.9626742Z graph_break [] 2025-12-04T09:58:53.9626817Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9626859Z Autotune Choices Stats: 2025-12-04T09:58:53.9627594Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:53.9627723Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9627837Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9628012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9628627Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9629235Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9629836Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9630459Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9631078Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9631684Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9632286Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9632911Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9633508Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9634114Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9634243Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:53.9634285Z Autotune Choices Stats: 2025-12-04T09:58:53.9635070Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.9635297Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9635460Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9635739Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9636389Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9637028Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9637656Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9638285Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9638911Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9639571Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9640194Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9640821Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9641457Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9642082Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9642213Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:53.9642287Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9642332Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9642370Z unimplemented [] 2025-12-04T09:58:53.9642432Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9642531Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9643101Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9643152Z graph_break [] 2025-12-04T09:58:53.9643236Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9643278Z Autotune Choices Stats: 2025-12-04T09:58:53.9644028Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:53.9644158Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9644273Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9644432Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9645038Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9645652Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9646286Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9646884Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9647510Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9648139Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9648738Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9649342Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9649960Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9650559Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9650690Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:53.9650730Z Autotune Choices Stats: 2025-12-04T09:58:53.9651493Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:53.9651736Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9651900Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9652189Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9652820Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9653445Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9654082Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9654702Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9655334Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9656008Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9656649Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9657278Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9657908Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9658543Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9658673Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:53.9658749Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9658796Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9658835Z unimplemented [] 2025-12-04T09:58:53.9658898Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9658997Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9659582Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9659622Z graph_break [] 2025-12-04T09:58:53.9659698Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9659747Z Autotune Choices Stats: 2025-12-04T09:58:53.9660504Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:53.9660644Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9660760Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9660922Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9661536Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9662137Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9662751Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9663357Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9663969Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9664592Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9665218Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9665819Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9667764Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9668391Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9668536Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:53.9668581Z Autotune Choices Stats: 2025-12-04T09:58:53.9669342Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:53.9669561Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9669727Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9670019Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9670674Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9671296Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9671919Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9672587Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9673215Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9673843Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9674476Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9675113Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9675742Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9676412Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9676558Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:53.9676637Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9676679Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9676720Z unimplemented [] 2025-12-04T09:58:53.9676782Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9676885Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9677460Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9677499Z graph_break [] 2025-12-04T09:58:53.9677576Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9677618Z Autotune Choices Stats: 2025-12-04T09:58:53.9678368Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:53.9678497Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9678628Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9678791Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9679416Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9680018Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9680639Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9681248Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9681850Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9682458Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9683072Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9683681Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9684283Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9684898Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9685039Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:53.9685081Z Autotune Choices Stats: 2025-12-04T09:58:53.9685840Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.9686096Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9686262Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9686545Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9687187Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9687823Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9688449Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9689080Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9689724Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9690349Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9690975Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9691609Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9692242Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9692877Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9693017Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:53.9693091Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9693146Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9693185Z unimplemented [] 2025-12-04T09:58:53.9693249Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9693347Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9693920Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9693959Z graph_break [] 2025-12-04T09:58:53.9694034Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9694075Z Autotune Choices Stats: 2025-12-04T09:58:53.9694827Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:53.9694956Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9695068Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9695231Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9695857Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9696494Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9697098Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9697716Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9698333Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9698937Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9699540Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9700151Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9700766Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9701376Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9701520Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:53.9701560Z Autotune Choices Stats: 2025-12-04T09:58:53.9702329Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:53.9702545Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9702711Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9702989Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9703619Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9704260Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9704889Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9705514Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9706273Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9706913Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9707548Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9708167Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9708811Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9709444Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9709574Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:53.9709668Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:53.9709718Z Traceback (most recent call last): 2025-12-04T09:58:53.9709869Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:53.9709925Z self.assertTrue( 2025-12-04T09:58:53.9710029Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:53.9710080Z raise self.failureException(msg) 2025-12-04T09:58:53.9710218Z AssertionError: False is not true : Log file /tmp/tmpa823a6nj/flex_attention_configs.json was not created 2025-12-04T09:58:53.9710221Z 2025-12-04T09:58:53.9710298Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:53.9710461Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:53.9710464Z 2025-12-04T09:58:53.9710556Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:53.9710633Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9710677Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9710715Z unimplemented [] 2025-12-04T09:58:53.9710781Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9711359Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:53.9711460Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9711500Z graph_break [] 2025-12-04T09:58:53.9711574Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9712071Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:53.9712121Z current_size = base.storage().size() 2025-12-04T09:58:53.9712164Z Autotune Choices Stats: 2025-12-04T09:58:53.9712936Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:53.9713066Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9713183Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9713347Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9713959Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9714579Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9715189Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9715799Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9716437Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9717054Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9717666Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9718269Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9718880Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9719499Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9719630Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:53.9719672Z Autotune Choices Stats: 2025-12-04T09:58:53.9720421Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.9720643Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9720808Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9721104Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9721743Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9722367Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9722987Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9723639Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9724263Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9724889Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9725528Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9726199Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9726822Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9727461Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9727604Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:53.9727682Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9727727Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9727766Z unimplemented [] 2025-12-04T09:58:53.9727829Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9727936Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9728506Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9728546Z graph_break [] 2025-12-04T09:58:53.9728622Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9728665Z Autotune Choices Stats: 2025-12-04T09:58:53.9729405Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:53.9729534Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9729663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9729823Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9730452Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9731054Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9731664Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9732281Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9732887Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9733489Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9734100Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9734709Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9735314Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9735959Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9736105Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:53.9736149Z Autotune Choices Stats: 2025-12-04T09:58:53.9736901Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.9737126Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9737296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9737573Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9738228Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9738861Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9739488Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9740107Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9740757Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9741385Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9742092Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9742731Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9743363Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9743987Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9744129Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:53.9744205Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9744258Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9744300Z unimplemented [] 2025-12-04T09:58:53.9744362Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9744465Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9745039Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.9745078Z graph_break [] 2025-12-04T09:58:53.9745153Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9745196Z Autotune Choices Stats: 2025-12-04T09:58:53.9745953Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:53.9746084Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9746203Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9746366Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9746993Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9747612Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9748219Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9748833Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9749450Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9750052Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9750652Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9751264Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9751877Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9752485Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9752625Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:53.9752667Z Autotune Choices Stats: 2025-12-04T09:58:53.9753433Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:53.9753650Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9753817Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9754097Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9754733Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9755373Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9756043Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9756674Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9757310Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9757955Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9758581Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9759215Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9759856Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9760486Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9760618Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:53.9760696Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9760740Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9760779Z unimplemented [] 2025-12-04T09:58:53.9760842Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9760942Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9761527Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9761582Z graph_break [] 2025-12-04T09:58:53.9761658Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9761700Z Autotune Choices Stats: 2025-12-04T09:58:53.9762441Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:53.9762570Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9762688Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9762848Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9763457Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9764074Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9764678Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9765280Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9765892Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9766544Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9767145Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9767751Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9768375Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9768991Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9769124Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:53.9769167Z Autotune Choices Stats: 2025-12-04T09:58:53.9769926Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.9770168Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9770338Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9770616Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9771253Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9771877Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9772518Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9773151Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9773777Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9774416Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9775049Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9775680Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9776335Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9776975Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9777118Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:53.9777198Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9777242Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9777281Z unimplemented [] 2025-12-04T09:58:53.9777341Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9777443Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9778027Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9778085Z graph_break [] 2025-12-04T09:58:53.9778159Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9778216Z Autotune Choices Stats: 2025-12-04T09:58:53.9778958Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:53.9779089Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9779207Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9779368Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9779986Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9780589Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9781204Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9781806Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9782415Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9783037Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9783645Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9784250Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9784851Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9785475Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9785605Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:53.9785648Z Autotune Choices Stats: 2025-12-04T09:58:53.9786457Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:53.9786674Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9786862Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9787152Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9787781Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9788406Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9789024Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9789659Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9790303Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9790930Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9791568Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9792203Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9792831Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9793457Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9793588Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:53.9793675Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9793718Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9793758Z unimplemented [] 2025-12-04T09:58:53.9793820Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9793937Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9794513Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9794554Z graph_break [] 2025-12-04T09:58:53.9794630Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9794673Z Autotune Choices Stats: 2025-12-04T09:58:53.9795423Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:53.9795570Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9795688Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9795851Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9796492Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9797097Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9797703Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9798335Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9798941Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9799546Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9800173Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9800773Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9801378Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9801986Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9802115Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:53.9802167Z Autotune Choices Stats: 2025-12-04T09:58:53.9802927Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:53.9803145Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9803314Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9803598Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9804240Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9804875Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9805497Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9806150Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9806803Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9807434Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9808060Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9808715Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9809341Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9809968Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9810098Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:53.9810176Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9810221Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9810261Z unimplemented [] 2025-12-04T09:58:53.9810323Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9810424Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9811031Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.9811070Z graph_break [] 2025-12-04T09:58:53.9811147Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9811189Z Autotune Choices Stats: 2025-12-04T09:58:53.9811927Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:53.9812055Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9812182Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9812352Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9812959Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9813565Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9814166Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9814771Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9815397Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9816038Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9816645Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9817275Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9817876Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9818480Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9818612Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:53.9818655Z Autotune Choices Stats: 2025-12-04T09:58:53.9819428Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:53.9819659Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9819828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9820105Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9820736Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9821385Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9822008Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9822637Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9823267Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9823912Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9824536Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9825164Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9825818Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9826493Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9826625Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:53.9826702Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9826747Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9826786Z unimplemented [] 2025-12-04T09:58:53.9826850Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9826952Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9827528Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9827568Z graph_break [] 2025-12-04T09:58:53.9827663Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9827705Z Autotune Choices Stats: 2025-12-04T09:58:53.9828456Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:53.9828585Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9828701Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9828865Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9829475Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9830103Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9830709Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9831314Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9831925Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9832536Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9833144Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9833746Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9834364Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9834969Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9835101Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:53.9835143Z Autotune Choices Stats: 2025-12-04T09:58:53.9835900Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:53.9836161Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9836345Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9836635Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9837272Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9837898Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9838550Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9839175Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9839804Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9840433Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9841078Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9841708Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9842341Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9842989Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9843122Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:53.9843198Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9843243Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9843281Z unimplemented [] 2025-12-04T09:58:53.9843344Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9843445Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9844021Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9844060Z graph_break [] 2025-12-04T09:58:53.9844138Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9844181Z Autotune Choices Stats: 2025-12-04T09:58:53.9844938Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:53.9845076Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9845190Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9845354Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9846003Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9846623Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9847237Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9847840Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9848451Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9849068Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9849682Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9850291Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9850891Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9851509Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9851641Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:53.9851681Z Autotune Choices Stats: 2025-12-04T09:58:53.9852446Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:53.9852665Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9852829Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9853108Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9853757Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9854390Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9855010Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9855651Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9856317Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9856949Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9857588Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9858251Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9858876Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9859502Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9859660Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:53.9859737Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9859784Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9859823Z unimplemented [] 2025-12-04T09:58:53.9859886Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9859988Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9860556Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9860595Z graph_break [] 2025-12-04T09:58:53.9860670Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9860712Z Autotune Choices Stats: 2025-12-04T09:58:53.9861453Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:53.9861586Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9861712Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9861876Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9862497Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9863099Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9863709Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9864325Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9864928Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9865531Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9866189Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9866798Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9867402Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9868008Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9868164Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:53.9868206Z Autotune Choices Stats: 2025-12-04T09:58:53.9868962Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:53.9869181Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9869347Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9869631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9870273Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9870905Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9871527Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9872158Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9872807Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9873430Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9874058Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9874702Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9875327Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9875987Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9876144Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:53.9876221Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9876278Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9876316Z unimplemented [] 2025-12-04T09:58:53.9876380Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9876479Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9877054Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9877095Z graph_break [] 2025-12-04T09:58:53.9877171Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9877216Z Autotune Choices Stats: 2025-12-04T09:58:53.9877966Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:53.9878097Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9878213Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9878375Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9878998Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9879619Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9880223Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9880834Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9881452Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9882051Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9882657Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9883283Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9883900Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9884497Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9884642Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:53.9884684Z Autotune Choices Stats: 2025-12-04T09:58:53.9885450Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:53.9885678Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9885842Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9886158Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9886793Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9887438Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9888073Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9888697Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9889338Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9889985Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9890607Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9891234Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9891874Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9892507Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9892642Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:53.9892722Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9892766Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9892804Z unimplemented [] 2025-12-04T09:58:53.9892868Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9892970Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9893558Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9893609Z graph_break [] 2025-12-04T09:58:53.9893684Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9893726Z Autotune Choices Stats: 2025-12-04T09:58:53.9894464Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:53.9894592Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9894711Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9894871Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9895489Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9896133Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9896745Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9897356Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9897965Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9898581Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9899186Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9899791Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9900403Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9901023Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9901153Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:53.9901196Z Autotune Choices Stats: 2025-12-04T09:58:53.9901951Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:53.9902191Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9902358Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9902636Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9903268Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9903891Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9904536Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9905167Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9905795Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9906487Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9907129Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9907756Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9908384Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9909019Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9909162Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:53.9909240Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9909282Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9909324Z unimplemented [] 2025-12-04T09:58:53.9909387Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9909490Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9910065Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:53.9910118Z graph_break [] 2025-12-04T09:58:53.9910196Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9910236Z Autotune Choices Stats: 2025-12-04T09:58:53.9910987Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:53.9911115Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9911231Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9911392Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9912006Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9912610Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9913232Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9913836Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9914437Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9915052Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9915667Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9916319Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9916918Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9917540Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9917685Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:53.9917727Z Autotune Choices Stats: 2025-12-04T09:58:53.9918488Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.9918708Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9918887Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9919177Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9919819Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9920446Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9921071Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9921709Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9922358Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9922982Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9923613Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9924257Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9924881Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9925504Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9925635Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:53.9925729Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9925774Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9925813Z unimplemented [] 2025-12-04T09:58:53.9925875Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9926015Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9926614Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9926654Z graph_break [] 2025-12-04T09:58:53.9926731Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9926773Z Autotune Choices Stats: 2025-12-04T09:58:53.9927511Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:53.9927664Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9927781Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9927948Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9928558Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9929163Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9929776Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9930405Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9931010Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9931614Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9932246Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9932849Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9933450Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9934059Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9934190Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:53.9934234Z Autotune Choices Stats: 2025-12-04T09:58:53.9935015Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:53.9935231Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9935399Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9935679Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9936364Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9939103Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9941449Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9943537Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9946493Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9948576Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9950649Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9952792Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9954964Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9957135Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9957604Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:53.9957889Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9958049Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9958185Z unimplemented [] 2025-12-04T09:58:53.9958410Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9958760Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9960741Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9960911Z graph_break [] 2025-12-04T09:58:53.9961184Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9961327Z Autotune Choices Stats: 2025-12-04T09:58:53.9963839Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:53.9964288Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9964819Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9965412Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9967518Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9969517Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9971506Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9973515Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9975606Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9976697Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9977345Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9978006Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9978606Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9979210Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9979353Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:53.9979398Z Autotune Choices Stats: 2025-12-04T09:58:53.9980185Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:53.9980428Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9980603Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9980885Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9981513Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9982176Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9982803Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9983428Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9984063Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9984722Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9985349Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9986029Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9986696Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9987324Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9987464Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:53.9987545Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:53.9987597Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:53.9987640Z unimplemented [] 2025-12-04T09:58:53.9987712Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:53.9987817Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:53.9988398Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:53.9988440Z graph_break [] 2025-12-04T09:58:53.9988523Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:53.9988585Z Autotune Choices Stats: 2025-12-04T09:58:53.9989351Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:53.9989489Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:53.9989608Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:53.9989775Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:53.9990394Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9991020Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9991626Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9992235Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9992838Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:53.9993469Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9994078Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9994683Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9995305Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9995909Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:53.9996104Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:53.9996150Z Autotune Choices Stats: 2025-12-04T09:58:53.9996909Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:53.9997136Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0017575Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0017884Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0018521Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0019153Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0019814Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0020447Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0021078Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0021708Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0022365Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0023003Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0023627Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0024271Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0024410Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.0024489Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0024541Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0024583Z unimplemented [] 2025-12-04T09:58:54.0024652Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0024756Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0025334Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0025382Z graph_break [] 2025-12-04T09:58:54.0025457Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0025506Z Autotune Choices Stats: 2025-12-04T09:58:54.0026304Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.0026452Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0026572Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0026741Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0027359Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0027960Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0028598Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0029209Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0029821Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0030442Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0031067Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0031673Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0032282Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0032908Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0033047Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.0033090Z Autotune Choices Stats: 2025-12-04T09:58:54.0033843Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.0034069Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0034238Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0034527Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0035187Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0035812Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0036490Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0037153Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0037779Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0038411Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0039055Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0039695Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0040329Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0040954Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0041109Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.0041190Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0041236Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0041280Z unimplemented [] 2025-12-04T09:58:54.0041344Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0041452Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0042029Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0042075Z graph_break [] 2025-12-04T09:58:54.0042152Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0042199Z Autotune Choices Stats: 2025-12-04T09:58:54.0042934Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.0043069Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0043200Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0043362Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0043988Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0044599Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0045215Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0045830Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0046470Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0047074Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0047708Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0048327Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0048932Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0049534Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0049699Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.0049749Z Autotune Choices Stats: 2025-12-04T09:58:54.0050504Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.0050726Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0050895Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0051179Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0051821Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0052465Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0053097Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0053726Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0054382Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0055014Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0055638Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0056331Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0056971Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0057597Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0057728Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.0057842Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.0057893Z Traceback (most recent call last): 2025-12-04T09:58:54.0058066Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.0058108Z self.assertTrue( 2025-12-04T09:58:54.0058230Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.0058280Z raise self.failureException(msg) 2025-12-04T09:58:54.0058415Z AssertionError: False is not true : Log file /tmp/tmp79ygt4gy/flex_attention_configs.json was not created 2025-12-04T09:58:54.0058419Z 2025-12-04T09:58:54.0058498Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.0058668Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.0058673Z 2025-12-04T09:58:54.0058768Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.0058849Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0058895Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0058939Z unimplemented [] 2025-12-04T09:58:54.0059003Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0059587Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.0059695Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0059734Z graph_break [] 2025-12-04T09:58:54.0059816Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0060309Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.0060376Z current_size = base.storage().size() 2025-12-04T09:58:54.0060418Z Autotune Choices Stats: 2025-12-04T09:58:54.0061180Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.0061313Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0061430Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0061597Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0062214Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0062850Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0063455Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0064060Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0064681Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0065292Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0065903Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0066535Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0067159Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0067766Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0067903Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.0067947Z Autotune Choices Stats: 2025-12-04T09:58:54.0068716Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.0068935Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0069121Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0069418Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0070053Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0070682Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0071331Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0071961Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0072597Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0073226Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0073870Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0074499Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0075121Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0075771Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0075905Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.0076038Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0076087Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0076128Z unimplemented [] 2025-12-04T09:58:54.0076197Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0076299Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0076872Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0080899Z graph_break [] 2025-12-04T09:58:54.0080991Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0081036Z Autotune Choices Stats: 2025-12-04T09:58:54.0081826Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.0081988Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0082108Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0082281Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0082892Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0083502Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0084148Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0084758Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0085356Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0086040Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0086660Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0087268Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0087873Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0088500Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0088639Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.0088682Z Autotune Choices Stats: 2025-12-04T09:58:54.0089441Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.0089669Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0089841Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0090127Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0090778Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0091407Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0092040Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0092692Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0093321Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0093957Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0094579Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0095227Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0095860Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0096529Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0096703Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.0096782Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0096835Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0096877Z unimplemented [] 2025-12-04T09:58:54.0096948Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0097053Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0097633Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.0097673Z graph_break [] 2025-12-04T09:58:54.0097755Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0097802Z Autotune Choices Stats: 2025-12-04T09:58:54.0098542Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.0098679Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0098814Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0098983Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0099626Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0100224Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0100843Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0101468Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0102074Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0102682Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0103300Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0103922Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0104519Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0105125Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0105283Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.0105327Z Autotune Choices Stats: 2025-12-04T09:58:54.0106128Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.0106355Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0106523Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0106807Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0107439Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0108104Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0108734Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0109364Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0110027Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0110654Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0111281Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0111930Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0112572Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0113196Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0113334Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.0113427Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0113479Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0113538Z unimplemented [] 2025-12-04T09:58:54.0113608Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0113710Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0114290Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0114336Z graph_break [] 2025-12-04T09:58:54.0114411Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0114456Z Autotune Choices Stats: 2025-12-04T09:58:54.0115193Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.0115323Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0115439Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0115603Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0116291Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0116905Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0117507Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0118128Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0118745Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0119342Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0119950Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0120567Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0121172Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0121771Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0121905Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.0121957Z Autotune Choices Stats: 2025-12-04T09:58:54.0122716Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.0122946Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0123114Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0123393Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0124032Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0124662Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0125299Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0125947Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0126582Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0127234Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0127859Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0128494Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0129134Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0129767Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0129898Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.0129976Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0130020Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0130061Z unimplemented [] 2025-12-04T09:58:54.0130122Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0130224Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0130802Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0130868Z graph_break [] 2025-12-04T09:58:54.0130944Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0130985Z Autotune Choices Stats: 2025-12-04T09:58:54.0131730Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.0131861Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0131981Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0132148Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0132768Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0133390Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0134010Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0134619Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0135240Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0135855Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0136532Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0137144Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0137770Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0138388Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0138523Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.0138566Z Autotune Choices Stats: 2025-12-04T09:58:54.0139330Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.0139577Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0139747Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0140028Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0140674Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0141307Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0141949Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0142595Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0143244Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0143907Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0144570Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0145217Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0145875Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0146575Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0146720Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.0146802Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0146846Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0146887Z unimplemented [] 2025-12-04T09:58:54.0146952Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0147058Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0147654Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0147728Z graph_break [] 2025-12-04T09:58:54.0147805Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0147850Z Autotune Choices Stats: 2025-12-04T09:58:54.0148623Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.0148768Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0148889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0149056Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0149696Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0150329Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0150966Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0151605Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0152233Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0152876Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0153529Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0154181Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0154827Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0155488Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0155640Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.0155687Z Autotune Choices Stats: 2025-12-04T09:58:54.0156548Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.0156783Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0156982Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0157291Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0157971Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0158641Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0159310Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0159998Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0160685Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0161353Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0162027Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0162708Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0163395Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0164079Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0164220Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.0164302Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0164358Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0164402Z unimplemented [] 2025-12-04T09:58:54.0164468Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0164579Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0165226Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.0165270Z graph_break [] 2025-12-04T09:58:54.0165352Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0165395Z Autotune Choices Stats: 2025-12-04T09:58:54.0166245Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.0166408Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0166535Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0166710Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0167375Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0168039Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0168693Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0169371Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0170029Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0170688Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0171354Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0172021Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0172679Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0173336Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0173479Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.0173525Z Autotune Choices Stats: 2025-12-04T09:58:54.0174372Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.0174609Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0174793Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0175098Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0175795Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0176528Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0177207Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0177885Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0178586Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0179279Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0179959Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0180654Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0181354Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0182037Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0182177Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.0182262Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0182308Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0182350Z unimplemented [] 2025-12-04T09:58:54.0182417Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0182529Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0183172Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0183211Z graph_break [] 2025-12-04T09:58:54.0183309Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0183353Z Autotune Choices Stats: 2025-12-04T09:58:54.0184161Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.0184301Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0184436Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0184611Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0185286Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0185981Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0186634Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0187289Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0187976Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0188628Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0189281Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0189968Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0190617Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0191269Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0191409Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.0191453Z Autotune Choices Stats: 2025-12-04T09:58:54.0192285Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.0192519Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0192709Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0193011Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0193695Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0194387Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0195072Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0195744Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0196456Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0197167Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0197838Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0198517Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0199224Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0199895Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0200034Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.0200112Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0200158Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0200198Z unimplemented [] 2025-12-04T09:58:54.0200263Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0200369Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0201004Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0201044Z graph_break [] 2025-12-04T09:58:54.0201123Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0201167Z Autotune Choices Stats: 2025-12-04T09:58:54.0201990Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.0202128Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0202249Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0202423Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0203086Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0203754Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0204405Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0205060Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0205710Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0206421Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0207065Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0207713Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0208389Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0209037Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0209177Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.0209220Z Autotune Choices Stats: 2025-12-04T09:58:54.0210038Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.0210273Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0210461Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0210781Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0211462Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0212137Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0212829Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0213504Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0214186Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0214869Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0215577Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0216301Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0216981Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0217700Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0217846Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.0217932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0217986Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0218030Z unimplemented [] 2025-12-04T09:58:54.0218104Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0218212Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0218964Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0219013Z graph_break [] 2025-12-04T09:58:54.0219098Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0219150Z Autotune Choices Stats: 2025-12-04T09:58:54.0219980Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.0220127Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0220269Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0220449Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0221117Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0221767Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0222457Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0223116Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0223775Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0224436Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0225113Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0225768Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0226450Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0227137Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0227286Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.0227333Z Autotune Choices Stats: 2025-12-04T09:58:54.0228151Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.0228397Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0228578Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0228885Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0229601Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0230277Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0230961Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0231656Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0232341Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0233025Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0233704Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0234416Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0235102Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0235785Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0235998Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.0236082Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0236136Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0236180Z unimplemented [] 2025-12-04T09:58:54.0236254Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0236363Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0236992Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0237040Z graph_break [] 2025-12-04T09:58:54.0237122Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0237172Z Autotune Choices Stats: 2025-12-04T09:58:54.0237970Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.0238115Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0238241Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0238456Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0239137Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0239794Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0240451Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0241143Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0241804Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0242458Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0243129Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0243796Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0244456Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0245116Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0245285Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.0245332Z Autotune Choices Stats: 2025-12-04T09:58:54.0246201Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.0246444Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0246625Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0246932Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0247619Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0248325Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0249018Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0249698Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0250404Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0251081Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0251765Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0252469Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0253168Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0253851Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0253996Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.0254097Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0254146Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0254195Z unimplemented [] 2025-12-04T09:58:54.0254275Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0254390Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0255011Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0255058Z graph_break [] 2025-12-04T09:58:54.0255142Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0255195Z Autotune Choices Stats: 2025-12-04T09:58:54.0256050Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.0256195Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0256327Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0256504Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0257193Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0257864Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0258526Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0259199Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0259872Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0260530Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0261184Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0261857Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0262520Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0263173Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0263320Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.0263384Z Autotune Choices Stats: 2025-12-04T09:58:54.0264211Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.0264462Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0264650Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0264953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0265654Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0266407Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0267098Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0267793Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0268483Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0269201Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0269878Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0270569Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0271271Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0271966Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0272110Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.0272198Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0272246Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0272295Z unimplemented [] 2025-12-04T09:58:54.0272364Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0272479Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0273103Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.0273173Z graph_break [] 2025-12-04T09:58:54.0273256Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0273308Z Autotune Choices Stats: 2025-12-04T09:58:54.0274133Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.0274275Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0274403Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0274579Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0275249Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0275965Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0276635Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0277297Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0277984Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0278659Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0279318Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0279981Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0280657Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0281328Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0281471Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.0281521Z Autotune Choices Stats: 2025-12-04T09:58:54.0282347Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.0282621Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0282809Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0283115Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0283812Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0284503Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0285189Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0285884Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0286607Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0287301Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0288014Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0288706Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0289404Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0290101Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0290242Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.0290346Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0290395Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0290445Z unimplemented [] 2025-12-04T09:58:54.0290515Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0290630Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0291265Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0291313Z graph_break [] 2025-12-04T09:58:54.0291413Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0291460Z Autotune Choices Stats: 2025-12-04T09:58:54.0292269Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.0292423Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0292556Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0292733Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0293416Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0294077Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0294743Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0295408Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0296107Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0296779Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0297451Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0298114Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0298775Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0299450Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0299590Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.0299659Z Autotune Choices Stats: 2025-12-04T09:58:54.0300493Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.0300732Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0300930Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0301245Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0301938Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0302627Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0303311Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0303999Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0304712Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0305401Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0306125Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0306828Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0307516Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0308203Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0308351Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.0308435Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0308488Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0308544Z unimplemented [] 2025-12-04T09:58:54.0308617Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0308727Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0309370Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0309416Z graph_break [] 2025-12-04T09:58:54.0309504Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0309552Z Autotune Choices Stats: 2025-12-04T09:58:54.0310360Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.0310527Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0310653Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0310836Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0311497Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0312159Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0312827Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0313513Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0314190Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0314853Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0315528Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0316258Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0316913Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0317570Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0317719Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.0317766Z Autotune Choices Stats: 2025-12-04T09:58:54.0318627Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.0318865Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0319052Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0319361Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0320064Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0320758Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0321440Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0322121Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0322830Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0323522Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0324203Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0324904Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0325592Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0326334Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0326479Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.0326564Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0326619Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0326663Z unimplemented [] 2025-12-04T09:58:54.0326737Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0326849Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0327493Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0327534Z graph_break [] 2025-12-04T09:58:54.0327635Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0327684Z Autotune Choices Stats: 2025-12-04T09:58:54.0328483Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.0328629Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0328756Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0328950Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0329629Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0330290Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0330959Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0331626Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0332303Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0332959Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0333625Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0334296Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0334964Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0335627Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0335773Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.0335823Z Autotune Choices Stats: 2025-12-04T09:58:54.0336708Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.0336949Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0337143Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0337450Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0338142Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0338835Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0339535Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0340228Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0340912Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0341608Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0342305Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0342990Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0343682Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0344387Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0344537Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.0344620Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0344673Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0344716Z unimplemented [] 2025-12-04T09:58:54.0344790Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0344901Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0345529Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0345579Z graph_break [] 2025-12-04T09:58:54.0345661Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0345713Z Autotune Choices Stats: 2025-12-04T09:58:54.0346581Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.0346727Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0346852Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0347034Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0347707Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0348391Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0349060Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0349723Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0350384Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0351068Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0351728Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0352390Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0353070Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0353729Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0353879Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.0353926Z Autotune Choices Stats: 2025-12-04T09:58:54.0354756Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.0355001Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0355197Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0355503Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0356251Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0356930Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0357636Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0358334Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0359024Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0359706Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0360417Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0361175Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0361859Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0362567Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0362715Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.0362799Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0362852Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0362896Z unimplemented [] 2025-12-04T09:58:54.0362970Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0363082Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0363713Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0363763Z graph_break [] 2025-12-04T09:58:54.0363846Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0363898Z Autotune Choices Stats: 2025-12-04T09:58:54.0364723Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.0364870Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0365012Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0365190Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0365860Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0366557Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0367239Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0367902Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0368564Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0369225Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0369915Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0370579Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0371243Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0371920Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0372068Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.0372120Z Autotune Choices Stats: 2025-12-04T09:58:54.0372949Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.0373196Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0373379Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0373688Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0374407Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0375087Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0375777Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0376509Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0377198Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0377890Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0378570Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0379285Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0379977Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0380660Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0380828Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.0380916Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0380965Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0381014Z unimplemented [] 2025-12-04T09:58:54.0381083Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0381199Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0381833Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0381881Z graph_break [] 2025-12-04T09:58:54.0381964Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0382015Z Autotune Choices Stats: 2025-12-04T09:58:54.0382817Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.0382965Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0383096Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0383283Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0383965Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0384618Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0385273Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0385984Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0386655Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0387319Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0387966Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0388662Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0389323Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0389979Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0390145Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.0390197Z Autotune Choices Stats: 2025-12-04T09:58:54.0391034Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.0391276Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0391459Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0391759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0392448Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0393161Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0393832Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0394511Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0395223Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0395910Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0396632Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0397319Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0398033Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0398716Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0398856Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.0398967Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.0399036Z Traceback (most recent call last): 2025-12-04T09:58:54.0399213Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.0399273Z self.assertTrue( 2025-12-04T09:58:54.0399396Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.0399451Z raise self.failureException(msg) 2025-12-04T09:58:54.0399598Z AssertionError: False is not true : Log file /tmp/tmpmiti4lfu/flex_attention_configs.json was not created 2025-12-04T09:58:54.0399601Z 2025-12-04T09:58:54.0399688Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.0399876Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.0399879Z 2025-12-04T09:58:54.0399980Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.0400071Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0400120Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0400168Z unimplemented [] 2025-12-04T09:58:54.0400237Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0400872Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.0400987Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0401030Z graph_break [] 2025-12-04T09:58:54.0401120Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0401663Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.0401725Z current_size = base.storage().size() 2025-12-04T09:58:54.0401773Z Autotune Choices Stats: 2025-12-04T09:58:54.0402599Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.0402748Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0402874Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0403060Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0403730Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0404412Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0405076Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0405748Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0406456Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0407144Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0407800Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0408461Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0409152Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0409817Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0409965Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.0410010Z Autotune Choices Stats: 2025-12-04T09:58:54.0410843Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.0411084Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0411280Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0411588Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0412283Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0412964Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0413660Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0414352Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0415041Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0415725Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0416448Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0417137Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0417826Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0418536Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0418682Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.0418767Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0418818Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0418860Z unimplemented [] 2025-12-04T09:58:54.0418932Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0419043Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0419674Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0419717Z graph_break [] 2025-12-04T09:58:54.0419805Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0419851Z Autotune Choices Stats: 2025-12-04T09:58:54.0420668Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.0420812Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0420949Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0421132Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0421801Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0422463Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0423149Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0423809Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0424469Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0425128Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0425810Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0426514Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0427167Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0427869Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0428015Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.0428064Z Autotune Choices Stats: 2025-12-04T09:58:54.0428894Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.0429135Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0429315Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0429628Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0430347Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0431031Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0431715Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0432423Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0433105Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0433791Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0434477Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0435180Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0435857Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0436604Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0436778Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.0436861Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0436914Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0436958Z unimplemented [] 2025-12-04T09:58:54.0437032Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0437143Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0437767Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.0437815Z graph_break [] 2025-12-04T09:58:54.0437897Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0437948Z Autotune Choices Stats: 2025-12-04T09:58:54.0438760Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.0438906Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0439032Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0439226Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0439909Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0440564Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0441226Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0441905Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0442567Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0443226Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0443888Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0444577Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0445237Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0445897Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0446104Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.0446151Z Autotune Choices Stats: 2025-12-04T09:58:54.0446994Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.0447238Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0447422Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0447739Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0448430Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0449146Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0449833Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0450519Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0451234Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0451917Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0452603Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0453296Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0454006Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0454690Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0454835Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.0454922Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0454982Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0455031Z unimplemented [] 2025-12-04T09:58:54.0455100Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0455228Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0455854Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0455903Z graph_break [] 2025-12-04T09:58:54.0456020Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0456073Z Autotune Choices Stats: 2025-12-04T09:58:54.0456883Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.0457030Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0457164Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0457342Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0458029Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0458708Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0459360Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0460018Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0460698Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0461357Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0462016Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0462700Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0463376Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0464032Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0464176Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.0464227Z Autotune Choices Stats: 2025-12-04T09:58:54.0465063Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.0465316Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0465501Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0465809Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0466539Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0467222Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0467931Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0468613Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0469302Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0470038Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0470718Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0471411Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0472108Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0472803Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0472944Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.0473031Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0473080Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0473129Z unimplemented [] 2025-12-04T09:58:54.0473199Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0473316Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0473944Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0474014Z graph_break [] 2025-12-04T09:58:54.0474095Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0474146Z Autotune Choices Stats: 2025-12-04T09:58:54.0474964Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.0475105Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0475236Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0475415Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0476120Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0476807Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0477473Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0478126Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0478799Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0479481Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0480142Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0480803Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0481477Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0482148Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0482291Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.0482341Z Autotune Choices Stats: 2025-12-04T09:58:54.0483156Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.0483420Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0483610Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0483917Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0484613Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0485298Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0486064Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0486756Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0487446Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0488139Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0488858Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0489540Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0490235Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0490934Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0491075Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.0491178Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0491229Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0491279Z unimplemented [] 2025-12-04T09:58:54.0491349Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0491466Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0492093Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0492144Z graph_break [] 2025-12-04T09:58:54.0492230Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0492289Z Autotune Choices Stats: 2025-12-04T09:58:54.0493099Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.0493251Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0493383Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0493564Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0494231Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0494893Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0495565Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0496283Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0496952Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0497631Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0498302Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0498960Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0499622Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0500290Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0500431Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.0500480Z Autotune Choices Stats: 2025-12-04T09:58:54.0501322Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.0501560Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0501746Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0502066Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0502760Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0503448Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0504134Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0504821Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0505519Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0506246Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0506925Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0507638Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0508332Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0509019Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0509162Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.0509248Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0509297Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0509342Z unimplemented [] 2025-12-04T09:58:54.0509426Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0509542Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0510191Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.0510236Z graph_break [] 2025-12-04T09:58:54.0510324Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0510370Z Autotune Choices Stats: 2025-12-04T09:58:54.0511183Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.0511346Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0511477Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0511662Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0512331Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0512994Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0513654Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0514324Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0515007Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0515673Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0516380Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0517052Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0517710Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0518376Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0518527Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.0518573Z Autotune Choices Stats: 2025-12-04T09:58:54.0519431Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.0519668Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0519855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0520172Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0520865Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0521573Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0522254Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0522940Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0523636Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0524333Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0525023Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0525718Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0526445Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0527131Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0527280Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.0527361Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0527416Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0527459Z unimplemented [] 2025-12-04T09:58:54.0527533Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0527643Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0528293Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0528336Z graph_break [] 2025-12-04T09:58:54.0528425Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0528485Z Autotune Choices Stats: 2025-12-04T09:58:54.0529299Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.0529444Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0529570Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0529767Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0530432Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0531110Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0531771Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0532431Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0533103Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0533780Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0534433Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0535105Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0535772Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0536490Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0536638Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.0536685Z Autotune Choices Stats: 2025-12-04T09:58:54.0537536Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.0537774Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0537981Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0538290Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0538973Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0539668Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0540371Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0541054Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0541736Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0542434Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0543129Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0543818Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0544505Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0545201Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0545350Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.0545433Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0545486Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0545530Z unimplemented [] 2025-12-04T09:58:54.0545604Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0545714Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0546385Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0546430Z graph_break [] 2025-12-04T09:58:54.0546518Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0546567Z Autotune Choices Stats: 2025-12-04T09:58:54.0547406Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.0547552Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0547676Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0547860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0548532Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0549219Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0549882Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0550545Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0551202Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0551882Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0552539Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0553199Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0553870Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0554537Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0554686Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.0554733Z Autotune Choices Stats: 2025-12-04T09:58:54.0555562Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.0555804Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0556080Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0556386Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0557091Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0557780Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0558478Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0559172Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0559857Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0560545Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0561236Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0561929Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0562618Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0563313Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0563469Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.0563553Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0563611Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0563655Z unimplemented [] 2025-12-04T09:58:54.0563729Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0563840Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0564475Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0564524Z graph_break [] 2025-12-04T09:58:54.0564607Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0564661Z Autotune Choices Stats: 2025-12-04T09:58:54.0565479Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.0565626Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0565764Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0566005Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0566677Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0567338Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0568033Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0568694Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0569351Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0570010Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0570692Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0571344Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0572016Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0572693Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0572842Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.0572888Z Autotune Choices Stats: 2025-12-04T09:58:54.0573716Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.0573960Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0574143Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0574456Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0575158Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0575851Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0576603Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0577302Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0578003Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0578690Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0579370Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0580078Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0580766Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0581446Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0581615Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.0581703Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0581754Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0581804Z unimplemented [] 2025-12-04T09:58:54.0581874Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0581989Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0582616Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0582666Z graph_break [] 2025-12-04T09:58:54.0582749Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0582800Z Autotune Choices Stats: 2025-12-04T09:58:54.0583606Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.0583755Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0583886Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0584082Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0584767Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0585430Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0586116Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0586806Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0587471Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0588127Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0588784Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0589476Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0590136Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0590790Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0590956Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.0591008Z Autotune Choices Stats: 2025-12-04T09:58:54.0591838Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.0592081Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0592268Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0592571Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0593260Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0593971Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0594651Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0595338Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0596095Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0596783Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0597462Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0598150Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0598868Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0599553Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0599693Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.0599783Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0599846Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0599894Z unimplemented [] 2025-12-04T09:58:54.0599963Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0600075Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0600720Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0600768Z graph_break [] 2025-12-04T09:58:54.0600851Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0600904Z Autotune Choices Stats: 2025-12-04T09:58:54.0601722Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.0601862Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0601991Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0602170Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0602862Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0603530Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0604191Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0604853Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0605531Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0606234Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0606894Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0607555Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0608240Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0608893Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0609034Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.0609087Z Autotune Choices Stats: 2025-12-04T09:58:54.0609928Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.0610182Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0610370Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0610669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0611362Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0612051Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0612750Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0613451Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0614140Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0614860Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0615538Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0616257Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0616944Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0617658Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0617802Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.0617892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0617941Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0617990Z unimplemented [] 2025-12-04T09:58:54.0618058Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0618176Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0618800Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.0618882Z graph_break [] 2025-12-04T09:58:54.0618970Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0619017Z Autotune Choices Stats: 2025-12-04T09:58:54.0619818Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.0619960Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0620095Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0620272Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0620939Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0621620Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0622285Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0622944Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0623601Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0624290Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0624947Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0625602Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0626323Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0626999Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0627143Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.0627196Z Autotune Choices Stats: 2025-12-04T09:58:54.0628026Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.0628277Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0628475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0628778Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0631755Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0632433Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0633109Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0633816Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0634498Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0635178Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0635882Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0636602Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0637276Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0637981Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0638120Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.0638206Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0638268Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0638311Z unimplemented [] 2025-12-04T09:58:54.0638379Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0638490Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0639108Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0639147Z graph_break [] 2025-12-04T09:58:54.0639229Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0639272Z Autotune Choices Stats: 2025-12-04T09:58:54.0640100Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.0640251Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0640375Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0640552Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0641217Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0641876Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0642544Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0643197Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0643858Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0644519Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0645179Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0645827Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0646494Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0647154Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0647294Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.0647338Z Autotune Choices Stats: 2025-12-04T09:58:54.0648182Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.0648420Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0648601Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0648922Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0649628Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0650307Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0650983Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0651668Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0652358Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0653040Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0653717Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0654423Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0655103Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0655779Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0655960Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.0656042Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0656090Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0656130Z unimplemented [] 2025-12-04T09:58:54.0656198Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0656324Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0656976Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0657017Z graph_break [] 2025-12-04T09:58:54.0657102Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0657145Z Autotune Choices Stats: 2025-12-04T09:58:54.0657946Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.0658100Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0658237Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0658414Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0659074Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0659728Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0660376Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0661044Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0661707Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0662358Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0663023Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0663690Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0664337Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0664987Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0665130Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.0665172Z Autotune Choices Stats: 2025-12-04T09:58:54.0666072Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.0666308Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0666491Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0666794Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0667490Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0668186Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0668861Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0669534Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0670224Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0670911Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0671587Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0672276Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0672964Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0673641Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0673781Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.0673862Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0673911Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0673951Z unimplemented [] 2025-12-04T09:58:54.0674019Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0674126Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0674766Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0674806Z graph_break [] 2025-12-04T09:58:54.0674888Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0674931Z Autotune Choices Stats: 2025-12-04T09:58:54.0675748Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.0675888Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0676047Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0676239Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0676900Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0677565Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0678227Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0678876Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0679536Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0680204Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0680858Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0681520Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0682180Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0682837Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0682977Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.0683019Z Autotune Choices Stats: 2025-12-04T09:58:54.0683841Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.0684088Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0684275Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0684579Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0685263Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0685995Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0686689Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0687367Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0688054Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0688747Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0689434Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0690117Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0690807Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0691502Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0691642Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.0691722Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0691769Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0691808Z unimplemented [] 2025-12-04T09:58:54.0691875Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0691982Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0692606Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0692648Z graph_break [] 2025-12-04T09:58:54.0692726Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0692770Z Autotune Choices Stats: 2025-12-04T09:58:54.0693601Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.0693740Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0693862Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0694037Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0694697Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0695370Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0696053Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0696699Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0697347Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0698027Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0698689Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0699337Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0700002Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0700657Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0700796Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.0700838Z Autotune Choices Stats: 2025-12-04T09:58:54.0701654Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.0701890Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0702072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0702377Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0703068Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0703741Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0704428Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0705115Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0705793Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0706524Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0707210Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0707895Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0708574Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0709263Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0709416Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.0709496Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0709543Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0709585Z unimplemented [] 2025-12-04T09:58:54.0709649Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0709756Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0710379Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0710420Z graph_break [] 2025-12-04T09:58:54.0710499Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0710543Z Autotune Choices Stats: 2025-12-04T09:58:54.0711342Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.0711490Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0711612Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0711795Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0712453Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0713108Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0713780Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0714430Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0715087Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0715736Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0716585Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0717230Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0717876Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0718540Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0718692Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.0718736Z Autotune Choices Stats: 2025-12-04T09:58:54.0719554Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.0719791Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0719969Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0720268Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0720962Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0721650Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0722324Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0723001Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0723702Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0724385Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0725057Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0725768Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0726466Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0727141Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0727306Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.0727388Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0727432Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0727472Z unimplemented [] 2025-12-04T09:58:54.0727539Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0727649Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0728271Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0728315Z graph_break [] 2025-12-04T09:58:54.0728394Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0728439Z Autotune Choices Stats: 2025-12-04T09:58:54.0729233Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.0729371Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0729494Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0729669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0730361Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0731013Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0731662Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0732334Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0732985Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0733636Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0734290Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0734963Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0735608Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0736302Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0736465Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.0736509Z Autotune Choices Stats: 2025-12-04T09:58:54.0737331Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.0737566Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0737746Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0738046Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0738737Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0739439Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0740111Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0740786Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0741490Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0742169Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0742841Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0743529Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0744240Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0744913Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0745050Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.0745130Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0745174Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0745228Z unimplemented [] 2025-12-04T09:58:54.0745293Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0745400Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0746074Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0746115Z graph_break [] 2025-12-04T09:58:54.0746194Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0746238Z Autotune Choices Stats: 2025-12-04T09:58:54.0747036Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.0747174Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0747297Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0747472Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0748151Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0748814Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0749467Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0750115Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0750795Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0751447Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0752106Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0752755Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0753426Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0754073Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0754212Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.0754255Z Autotune Choices Stats: 2025-12-04T09:58:54.0755077Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.0755334Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0755512Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0755814Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0756536Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0757205Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0757913Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0758586Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0759263Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0759974Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0760647Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0761326Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0762001Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0762691Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0762830Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.0762931Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.0762983Z Traceback (most recent call last): 2025-12-04T09:58:54.0763149Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.0763194Z self.assertTrue( 2025-12-04T09:58:54.0763306Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.0763360Z raise self.failureException(msg) 2025-12-04T09:58:54.0763497Z AssertionError: False is not true : Log file /tmp/tmp8zeqdgl2/flex_attention_configs.json was not created 2025-12-04T09:58:54.0763512Z 2025-12-04T09:58:54.0763594Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.0763771Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.0763785Z 2025-12-04T09:58:54.0763882Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.0763963Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0764009Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0764049Z unimplemented [] 2025-12-04T09:58:54.0764117Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0764748Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.0764857Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0764895Z graph_break [] 2025-12-04T09:58:54.0764976Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0765508Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.0765563Z current_size = base.storage().size() 2025-12-04T09:58:54.0765609Z Autotune Choices Stats: 2025-12-04T09:58:54.0766498Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.0766641Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0766766Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0766957Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0767617Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0768270Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0768952Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0769603Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0770254Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0770901Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0771576Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0772225Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0772876Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0773547Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0773692Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.0773735Z Autotune Choices Stats: 2025-12-04T09:58:54.0774546Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.0774783Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0774960Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0775264Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0775993Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0776676Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0777349Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0778048Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0778734Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0779406Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0780079Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0780772Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0781446Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0782119Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0782280Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.0782362Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0782413Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0782454Z unimplemented [] 2025-12-04T09:58:54.0782523Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0782631Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0783252Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0783299Z graph_break [] 2025-12-04T09:58:54.0783379Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0783426Z Autotune Choices Stats: 2025-12-04T09:58:54.0784226Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.0784367Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0784494Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0784669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0785345Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0786025Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0786678Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0787351Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0788005Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0788657Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0789307Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0789980Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0790635Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0791283Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0791447Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.0791493Z Autotune Choices Stats: 2025-12-04T09:58:54.0792310Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.0792555Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0792734Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0793037Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0793718Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0794425Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0795099Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0795771Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0796507Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0797186Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0797861Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0798539Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0799252Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0799926Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0800067Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.0800149Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0800196Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0800255Z unimplemented [] 2025-12-04T09:58:54.0800323Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0800433Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0801069Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.0801112Z graph_break [] 2025-12-04T09:58:54.0801192Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0801240Z Autotune Choices Stats: 2025-12-04T09:58:54.0802039Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.0802182Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0802306Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0802482Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0803150Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0803823Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0804475Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0805126Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0805805Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0806483Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0807143Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0807802Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0808483Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0809136Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0809276Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.0809323Z Autotune Choices Stats: 2025-12-04T09:58:54.0810156Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.0810429Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0810608Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0810914Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0811602Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0812281Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0812987Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0813663Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0814345Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0815041Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0815715Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0816440Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0817122Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0817828Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0817968Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.0818052Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0818097Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0818140Z unimplemented [] 2025-12-04T09:58:54.0818205Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0818315Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0818942Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0819015Z graph_break [] 2025-12-04T09:58:54.0819097Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0819144Z Autotune Choices Stats: 2025-12-04T09:58:54.0819947Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.0820085Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0820215Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0820390Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0821061Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0821723Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0822384Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0823038Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0823692Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0824364Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0825010Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0825664Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0826357Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0827036Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0827176Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.0827221Z Autotune Choices Stats: 2025-12-04T09:58:54.0828046Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.0828294Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0828493Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0828795Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0829477Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0830155Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0830829Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0831517Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0832200Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0832878Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0833572Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0834252Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0834935Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0835624Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0835762Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.0835844Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0835890Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0835981Z unimplemented [] 2025-12-04T09:58:54.0836066Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0836175Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0836801Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0836841Z graph_break [] 2025-12-04T09:58:54.0836923Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0838932Z Autotune Choices Stats: 2025-12-04T09:58:54.0839744Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.0839919Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0840045Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0840240Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0840910Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0841567Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0842221Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0842893Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0843544Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0844261Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0844941Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0845593Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0846297Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0846943Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0847085Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.0847129Z Autotune Choices Stats: 2025-12-04T09:58:54.0847966Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.0848202Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0848413Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0848725Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0849432Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0850107Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0850785Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0851466Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0852159Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0852828Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0853516Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0854218Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0854896Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0855569Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0855711Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.0855793Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0855842Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0855882Z unimplemented [] 2025-12-04T09:58:54.0856000Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0856113Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0856758Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0856804Z graph_break [] 2025-12-04T09:58:54.0856883Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0856927Z Autotune Choices Stats: 2025-12-04T09:58:54.0857743Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.0857898Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0858041Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0858219Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0858882Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0859540Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0860202Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0860864Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0861534Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0862195Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0862863Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0863527Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0864183Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0864841Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0864983Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.0865027Z Autotune Choices Stats: 2025-12-04T09:58:54.0865864Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.0866136Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0866314Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0866614Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0867316Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0868020Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0868697Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0869379Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0870058Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0870749Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0871430Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0872119Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0872819Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0873494Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0873636Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.0873719Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0873766Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0873807Z unimplemented [] 2025-12-04T09:58:54.0873875Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0873983Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0874603Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.0874646Z graph_break [] 2025-12-04T09:58:54.0874726Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0874771Z Autotune Choices Stats: 2025-12-04T09:58:54.0875586Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.0875731Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0875856Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0876122Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0876796Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0877468Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0878124Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0878779Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0879427Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0880104Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0880768Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0881436Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0882098Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0882753Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0882894Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.0882939Z Autotune Choices Stats: 2025-12-04T09:58:54.0883760Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.0883996Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0884190Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0884492Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0885175Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0885864Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0886603Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0887276Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0887952Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0888632Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0889326Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0890025Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0890717Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0891409Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0891549Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.0891634Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0891680Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0891723Z unimplemented [] 2025-12-04T09:58:54.0891789Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0891899Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0892519Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0892563Z graph_break [] 2025-12-04T09:58:54.0892643Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0892688Z Autotune Choices Stats: 2025-12-04T09:58:54.0893502Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.0893639Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0893765Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0893938Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0894613Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0895278Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0895969Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0896623Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0897276Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0897929Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0898594Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0899260Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0899929Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0900594Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0900732Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.0900780Z Autotune Choices Stats: 2025-12-04T09:58:54.0901597Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.0901831Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0902010Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0902305Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0902997Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0903685Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0904370Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0905062Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0905738Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0906445Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0907121Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0907815Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0908505Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0909197Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0909353Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.0909435Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0909485Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0909527Z unimplemented [] 2025-12-04T09:58:54.0909597Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0909706Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0910334Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0910378Z graph_break [] 2025-12-04T09:58:54.0910458Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0910505Z Autotune Choices Stats: 2025-12-04T09:58:54.0911312Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.0911454Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0911575Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0911766Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0912429Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0913096Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0913762Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0914427Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0915085Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0915735Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0916463Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0917137Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0917819Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0918484Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0918643Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.0918687Z Autotune Choices Stats: 2025-12-04T09:58:54.0919511Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.0919753Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0919933Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0920237Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0920918Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0921610Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0922311Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0922995Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0923694Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0924368Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0925040Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0925707Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0926448Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0927118Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0927272Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.0927365Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0927413Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0927458Z unimplemented [] 2025-12-04T09:58:54.0927528Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0927634Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0928255Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0928299Z graph_break [] 2025-12-04T09:58:54.0928378Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0928424Z Autotune Choices Stats: 2025-12-04T09:58:54.0929212Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.0929351Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0929479Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0929650Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0930320Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0930960Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0931625Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0932291Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0932931Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0933571Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0934218Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0934867Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0935479Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0936144Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0936289Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.0936344Z Autotune Choices Stats: 2025-12-04T09:58:54.0937117Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.0937340Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0937508Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0937792Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0938438Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0939073Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0939730Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0940375Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0941024Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0941669Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0942300Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0942939Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0943584Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0944214Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0944348Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.0944429Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0944483Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0944525Z unimplemented [] 2025-12-04T09:58:54.0944597Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0944702Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0945278Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0945330Z graph_break [] 2025-12-04T09:58:54.0945404Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0945449Z Autotune Choices Stats: 2025-12-04T09:58:54.0946235Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.0946368Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0946487Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0946651Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0947266Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0947893Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0948497Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0949113Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0949743Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0950346Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0950956Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0951564Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0952181Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0952787Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0952917Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.0952960Z Autotune Choices Stats: 2025-12-04T09:58:54.0953735Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.0953974Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0954143Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0954423Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0955065Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0955695Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0956376Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0957009Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0957653Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0958304Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0958932Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0959563Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0960194Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0960832Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0960962Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.0961040Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0961084Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0961125Z unimplemented [] 2025-12-04T09:58:54.0961187Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0961290Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0961875Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0961927Z graph_break [] 2025-12-04T09:58:54.0962013Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0962056Z Autotune Choices Stats: 2025-12-04T09:58:54.0962802Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.0962930Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0963051Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0963213Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0963825Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0964428Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0965047Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0965650Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0966308Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0966945Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0967552Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0968158Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0968761Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0969392Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0969523Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.0969570Z Autotune Choices Stats: 2025-12-04T09:58:54.0970337Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.0970569Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0970749Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0971029Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0971670Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0972299Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0972921Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0973566Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0974197Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0974837Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0975487Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0976157Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0976786Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0977411Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0977542Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.0977621Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0977664Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0977706Z unimplemented [] 2025-12-04T09:58:54.0977781Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0977884Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0978461Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.0978501Z graph_break [] 2025-12-04T09:58:54.0978578Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0978619Z Autotune Choices Stats: 2025-12-04T09:58:54.0979384Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.0979538Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0979656Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0979824Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0980436Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0981046Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0981652Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0982265Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0982867Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0983493Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0984123Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0984726Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0985343Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0985982Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0986113Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.0986161Z Autotune Choices Stats: 2025-12-04T09:58:54.0986940Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.0987159Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0987348Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0987638Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0988285Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0988918Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0989546Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0990170Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0990811Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0991453Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0992090Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0992737Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0993370Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0993998Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0994130Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.0994206Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.0994253Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.0994291Z unimplemented [] 2025-12-04T09:58:54.0994358Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.0994458Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.0995057Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.0995096Z graph_break [] 2025-12-04T09:58:54.0995174Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.0995214Z Autotune Choices Stats: 2025-12-04T09:58:54.0996003Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.0996143Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.0996273Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.0996437Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.0997053Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0997660Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.0998266Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0998872Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.0999495Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1000100Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1000718Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1001350Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1001956Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1002561Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1002694Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.1002737Z Autotune Choices Stats: 2025-12-04T09:58:54.1003510Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.1003726Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1003896Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1004177Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1004823Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1005472Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1006126Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1006747Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1007376Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1008019Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1008642Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1009286Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1009941Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1010566Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1010698Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.1010773Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1010819Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1010860Z unimplemented [] 2025-12-04T09:58:54.1010928Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1011027Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1011611Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1011651Z graph_break [] 2025-12-04T09:58:54.1011730Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1011771Z Autotune Choices Stats: 2025-12-04T09:58:54.1012526Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.1012659Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1012772Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1012952Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1013575Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1014197Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1014805Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1015416Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1016037Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1016653Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1017264Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1017885Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1018496Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1019109Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1019245Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.1019286Z Autotune Choices Stats: 2025-12-04T09:58:54.1020053Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.1020274Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1020450Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1020728Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1021362Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1022008Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1022656Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1023278Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1023910Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1024539Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1025175Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1025818Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1026471Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1027130Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1027943Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.1028192Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1028360Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1028481Z unimplemented [] 2025-12-04T09:58:54.1028608Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1028818Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1029542Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1030185Z graph_break [] 2025-12-04T09:58:54.1030322Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1030481Z Autotune Choices Stats: 2025-12-04T09:58:54.1031305Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.1032218Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1032503Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1032822Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1033656Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1034933Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1036248Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1037549Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1038799Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1040026Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1041303Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1042559Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1043813Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1045065Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1045833Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.1046080Z Autotune Choices Stats: 2025-12-04T09:58:54.1046906Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.1047917Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1048335Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1048819Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1049775Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1051071Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1052342Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1053657Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1054943Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1056252Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1057540Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1058845Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1060153Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1061451Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1062255Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.1062497Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1062654Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1062766Z unimplemented [] 2025-12-04T09:58:54.1062888Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1063085Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1063798Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1064439Z graph_break [] 2025-12-04T09:58:54.1064570Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1064722Z Autotune Choices Stats: 2025-12-04T09:58:54.1065526Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.1066455Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1066731Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1067061Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1067870Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1069141Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1070384Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1071636Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1072880Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1074120Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1075367Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1076672Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1077925Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1079186Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1079971Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.1080176Z Autotune Choices Stats: 2025-12-04T09:58:54.1080997Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.1082006Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1082427Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1082909Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1083858Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1085149Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1086508Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1087802Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1089105Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1090397Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1091700Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1092977Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1094272Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1095579Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1096418Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.1096673Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1096830Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1096942Z unimplemented [] 2025-12-04T09:58:54.1097061Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1097257Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1097970Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1098618Z graph_break [] 2025-12-04T09:58:54.1098749Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1098899Z Autotune Choices Stats: 2025-12-04T09:58:54.1099703Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.1100596Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1100873Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1101186Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1102019Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1103269Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1104518Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1105783Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1107040Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1108285Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1109532Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1110776Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1112034Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1113291Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1114067Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.1114285Z Autotune Choices Stats: 2025-12-04T09:58:54.1115103Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.1116141Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1116555Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1117032Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1117975Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1119257Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1120549Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1121834Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1123132Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1124435Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1125705Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1127017Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1128305Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1129609Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1130401Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.1130640Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1130796Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1130919Z unimplemented [] 2025-12-04T09:58:54.1131040Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1131245Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1131946Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1132602Z graph_break [] 2025-12-04T09:58:54.1135539Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1135708Z Autotune Choices Stats: 2025-12-04T09:58:54.1136557Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.1137465Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1137755Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1138077Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1138883Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1140161Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1141388Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1142637Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1143911Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1145163Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1146448Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1147690Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1148954Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1150193Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1150966Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.1151180Z Autotune Choices Stats: 2025-12-04T09:58:54.1152031Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.1153069Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1153484Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1153964Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1154924Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1156249Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1157536Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1158833Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1160141Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1161453Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1162759Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1164056Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1165352Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1166693Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1167493Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.1167743Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1167907Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1168023Z unimplemented [] 2025-12-04T09:58:54.1168143Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1168350Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1169087Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1169758Z graph_break [] 2025-12-04T09:58:54.1169897Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1170074Z Autotune Choices Stats: 2025-12-04T09:58:54.1170881Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.1171789Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1172071Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1172396Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1173219Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1174476Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1175723Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1176993Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1178250Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1179512Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1180752Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1182007Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1183244Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1184497Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1185259Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.1185462Z Autotune Choices Stats: 2025-12-04T09:58:54.1186333Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.1187355Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1187781Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1188254Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1189207Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1190495Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1191776Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1193060Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1194350Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1195652Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1196985Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1198260Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1199553Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1200835Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1201623Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.1201861Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1202017Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1202125Z unimplemented [] 2025-12-04T09:58:54.1202241Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1202454Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1203160Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1203802Z graph_break [] 2025-12-04T09:58:54.1203931Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1204081Z Autotune Choices Stats: 2025-12-04T09:58:54.1204903Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.1205826Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1206130Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1206440Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1207254Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1208490Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1209726Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1210981Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1212216Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1213461Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1214722Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1215981Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1217221Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1218462Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1219232Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.1219434Z Autotune Choices Stats: 2025-12-04T09:58:54.1220263Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.1221259Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1221688Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1222175Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1223139Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1224409Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1225689Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1227002Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1228281Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1229569Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1230877Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1232180Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1233589Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1234869Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1235650Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.1235908Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.1236119Z Traceback (most recent call last): 2025-12-04T09:58:54.1236350Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.1236569Z self.assertTrue( 2025-12-04T09:58:54.1236734Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.1236922Z raise self.failureException(msg) 2025-12-04T09:58:54.1237130Z AssertionError: False is not true : Log file /tmp/tmpjnhi31tc/flex_attention_configs.json was not created 2025-12-04T09:58:54.1237291Z 2025-12-04T09:58:54.1237367Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.1237669Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.1237869Z 2025-12-04T09:58:54.1237960Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.1238159Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1238311Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1238419Z unimplemented [] 2025-12-04T09:58:54.1238539Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1239224Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.1239947Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1240134Z graph_break [] 2025-12-04T09:58:54.1240260Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1240854Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.1241418Z current_size = base.storage().size() 2025-12-04T09:58:54.1241537Z Autotune Choices Stats: 2025-12-04T09:58:54.1242347Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.1243240Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1243512Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1243826Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1244628Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1245872Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1247145Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1248391Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1249749Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1251000Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1252227Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1253458Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1254705Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1255965Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1256734Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.1256936Z Autotune Choices Stats: 2025-12-04T09:58:54.1257765Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.1258787Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1259200Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1259674Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1260619Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1261884Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1263183Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1264455Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1265743Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1267095Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1268399Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1269685Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1270970Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1272268Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1273056Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.1273302Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1273459Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1273570Z unimplemented [] 2025-12-04T09:58:54.1273693Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1273889Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1274611Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1275262Z graph_break [] 2025-12-04T09:58:54.1275409Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1275562Z Autotune Choices Stats: 2025-12-04T09:58:54.1276400Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.1277297Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1277574Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1277892Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1278703Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1279942Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1281199Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1282433Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1283680Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1284946Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1286222Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1287463Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1288702Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1289957Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1290724Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.1290935Z Autotune Choices Stats: 2025-12-04T09:58:54.1291759Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.1292774Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1293206Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1293683Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1294620Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1295901Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1297210Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1298504Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1299783Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1301085Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1302410Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1303685Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1304962Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1306263Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1307054Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.1307296Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1307450Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1307563Z unimplemented [] 2025-12-04T09:58:54.1307683Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1307899Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1308605Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.1309247Z graph_break [] 2025-12-04T09:58:54.1309378Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1309535Z Autotune Choices Stats: 2025-12-04T09:58:54.1310349Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.1311279Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1311556Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1311868Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1312671Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1313918Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1315166Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1316443Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1317678Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1318950Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1320221Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1321461Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1322687Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1323929Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1324698Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.1324904Z Autotune Choices Stats: 2025-12-04T09:58:54.1325731Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.1326770Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1327202Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1327695Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1328655Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1329935Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1331218Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1332501Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1333796Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1335090Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1336434Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1337743Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1339038Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1340328Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1341118Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.1341359Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1341517Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1341626Z unimplemented [] 2025-12-04T09:58:54.1341751Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1341946Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1342682Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1343324Z graph_break [] 2025-12-04T09:58:54.1343455Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1343609Z Autotune Choices Stats: 2025-12-04T09:58:54.1344424Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.1345326Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1345603Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1345957Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1346768Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1348016Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1349260Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1350497Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1351754Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1352995Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1354247Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1355515Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1356779Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1358020Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1358787Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.1358996Z Autotune Choices Stats: 2025-12-04T09:58:54.1359824Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.1360839Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1361253Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1361729Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1362702Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1364025Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1365310Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1366625Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1367910Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1369219Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1370506Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1371810Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1373126Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1374415Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1375202Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.1375445Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1375604Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1375717Z unimplemented [] 2025-12-04T09:58:54.1375837Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1376068Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1376784Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1377425Z graph_break [] 2025-12-04T09:58:54.1377556Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1377708Z Autotune Choices Stats: 2025-12-04T09:58:54.1378521Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.1379416Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1379693Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1380023Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1380848Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1382101Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1383344Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1384587Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1385829Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1387119Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1388387Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1389659Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1390920Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1392158Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1392935Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.1393141Z Autotune Choices Stats: 2025-12-04T09:58:54.1393968Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.1394972Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1395395Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1395887Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1396867Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1398180Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1399500Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1400780Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1402076Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1403364Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1404665Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1405990Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1407289Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1408606Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1409391Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.1409630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1409789Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1409901Z unimplemented [] 2025-12-04T09:58:54.1410020Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1410212Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1410921Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1411564Z graph_break [] 2025-12-04T09:58:54.1411693Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1411847Z Autotune Choices Stats: 2025-12-04T09:58:54.1412641Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.1413546Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1413830Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1414146Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1414967Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1416262Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1417515Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1418761Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1420007Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1421248Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1422503Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1423757Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1425005Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1426275Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1427042Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.1427248Z Autotune Choices Stats: 2025-12-04T09:58:54.1428067Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.1429062Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1429475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1429955Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1430913Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1432202Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1433499Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1434819Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1436144Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1437435Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1438724Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1440030Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1441335Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1442625Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1443426Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.1443666Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1443820Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1443928Z unimplemented [] 2025-12-04T09:58:54.1444045Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1444239Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1444955Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.1445596Z graph_break [] 2025-12-04T09:58:54.1445728Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1445768Z Autotune Choices Stats: 2025-12-04T09:58:54.1446540Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.1446672Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1446788Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1446952Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1447574Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1448196Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1448811Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1449426Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1450031Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1450636Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1451245Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1451855Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1452467Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1453078Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1453222Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.1453262Z Autotune Choices Stats: 2025-12-04T09:58:54.1454018Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.1454234Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1454403Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1454681Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1455309Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1455988Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1456630Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1457270Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1457911Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1458535Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1459160Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1459784Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1460418Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1461060Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1461201Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.1461275Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1461331Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1461368Z unimplemented [] 2025-12-04T09:58:54.1461430Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1461530Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1462105Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1462141Z graph_break [] 2025-12-04T09:58:54.1462219Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1462260Z Autotune Choices Stats: 2025-12-04T09:58:54.1463002Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.1463132Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1463245Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1463406Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1464025Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1464635Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1465249Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1465860Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1466510Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1467109Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1467709Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1468310Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1468924Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1469542Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1469685Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.1469725Z Autotune Choices Stats: 2025-12-04T09:58:54.1470492Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.1470712Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1470875Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1471153Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1471782Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1472416Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1473061Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1473694Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1474331Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1474971Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1475593Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1476252Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1476877Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1477722Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1477854Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.1477928Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1477973Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1478011Z unimplemented [] 2025-12-04T09:58:54.1478094Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1478209Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1478791Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1478849Z graph_break [] 2025-12-04T09:58:54.1478924Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1478967Z Autotune Choices Stats: 2025-12-04T09:58:54.1479702Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.1479832Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1479947Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1480111Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1480725Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1481343Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1481945Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1482557Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1483173Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1483784Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1484389Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1484995Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1485598Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1486245Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1486375Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.1486416Z Autotune Choices Stats: 2025-12-04T09:58:54.1487197Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.1487445Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1487610Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1487891Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1488522Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1489152Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1489774Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1490415Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1491056Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1491689Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1492335Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1492956Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1493584Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1494213Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1494343Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.1494418Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1494464Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1494501Z unimplemented [] 2025-12-04T09:58:54.1494565Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1494666Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1495255Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1495306Z graph_break [] 2025-12-04T09:58:54.1495381Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1495435Z Autotune Choices Stats: 2025-12-04T09:58:54.1496211Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.1496342Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1496457Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1496618Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1497234Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1497834Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1498461Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1499067Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1499681Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1500306Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1500908Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1501516Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1502117Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1502732Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1502863Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.1502905Z Autotune Choices Stats: 2025-12-04T09:58:54.1503681Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.1503901Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1504081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1504370Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1505007Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1505630Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1506292Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1506937Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1507569Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1508207Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1508840Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1509483Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1510109Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1510730Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1510860Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.1510937Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1510979Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1511017Z unimplemented [] 2025-12-04T09:58:54.1511079Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1511263Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1511836Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1511876Z graph_break [] 2025-12-04T09:58:54.1511952Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1511999Z Autotune Choices Stats: 2025-12-04T09:58:54.1512752Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.1512902Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1513018Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1513179Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1513792Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1514402Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1515003Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1515628Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1516325Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1516940Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1517570Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1518173Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1518780Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1519380Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1519514Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.1519556Z Autotune Choices Stats: 2025-12-04T09:58:54.1520323Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.1520542Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1520725Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1521002Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1521649Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1522284Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1522904Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1523529Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1524173Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1524799Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1525429Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1526122Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1526745Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1527373Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1527502Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.1527580Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1527624Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1527663Z unimplemented [] 2025-12-04T09:58:54.1527723Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1527826Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1528419Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1528459Z graph_break [] 2025-12-04T09:58:54.1528533Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1528577Z Autotune Choices Stats: 2025-12-04T09:58:54.1529326Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.1529465Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1529584Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1529759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1530373Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1530978Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1531585Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1532185Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1532800Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1533406Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1534023Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1534646Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1535248Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1535858Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1536021Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.1536063Z Autotune Choices Stats: 2025-12-04T09:58:54.1536825Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.1537060Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1537226Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1537506Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1538155Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1538808Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1539428Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1540055Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1540682Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1541315Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1541939Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1542574Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1543216Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1543841Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1543970Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.1544047Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1544089Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1544132Z unimplemented [] 2025-12-04T09:58:54.1544192Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1544298Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1544871Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.1544914Z graph_break [] 2025-12-04T09:58:54.1544991Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1545031Z Autotune Choices Stats: 2025-12-04T09:58:54.1545785Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.1545914Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1546063Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1546222Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1546845Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1547469Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1548071Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1548675Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1549278Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1549900Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1550502Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1551114Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1551742Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1552345Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1552476Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.1552517Z Autotune Choices Stats: 2025-12-04T09:58:54.1553270Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.1553489Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1553657Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1553950Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1554577Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1555219Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1555869Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1556526Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1557158Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1557783Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1558433Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1559060Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1559700Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1560354Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1560486Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.1560560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1560606Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1560643Z unimplemented [] 2025-12-04T09:58:54.1560706Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1560806Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1561380Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1561416Z graph_break [] 2025-12-04T09:58:54.1561492Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1561534Z Autotune Choices Stats: 2025-12-04T09:58:54.1562284Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.1562426Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1562541Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1562705Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1563322Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1563942Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1564556Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1565154Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1565760Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1566394Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1567003Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1567603Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1568223Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1568857Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1568989Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.1569032Z Autotune Choices Stats: 2025-12-04T09:58:54.1569785Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.1570003Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1570170Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1570448Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1571092Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1571716Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1572353Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1572995Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1573620Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1574249Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1574873Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1575509Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1576176Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1576814Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1576971Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.1577044Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1577088Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1577125Z unimplemented [] 2025-12-04T09:58:54.1577187Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1577288Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1577863Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1577901Z graph_break [] 2025-12-04T09:58:54.1577978Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1578017Z Autotune Choices Stats: 2025-12-04T09:58:54.1578756Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.1578885Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1578998Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1579159Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1579779Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1580398Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1581009Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1581622Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1582229Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1582838Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1583441Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1584065Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1584685Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1585294Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1585433Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.1585473Z Autotune Choices Stats: 2025-12-04T09:58:54.1586279Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.1586506Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1586671Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1586951Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1587581Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1588231Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1588854Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1589487Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1590143Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1590768Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1591390Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1592022Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1592659Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1593292Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1593434Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.1593509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1593562Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1593599Z unimplemented [] 2025-12-04T09:58:54.1593662Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1593761Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1594337Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1594375Z graph_break [] 2025-12-04T09:58:54.1594447Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1594489Z Autotune Choices Stats: 2025-12-04T09:58:54.1595234Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.1595366Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1595481Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1595644Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1596303Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1596927Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1597559Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1598176Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1598799Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1599404Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1600007Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1600611Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1601232Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1601842Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1601987Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.1602028Z Autotune Choices Stats: 2025-12-04T09:58:54.1602799Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.1603020Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1603185Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1603463Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1604093Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1604721Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1605355Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1606025Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1606669Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1607307Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1607926Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1608560Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1609183Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1609822Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1609954Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.1610028Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1610072Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1610111Z unimplemented [] 2025-12-04T09:58:54.1610173Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1610282Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1610865Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1610919Z graph_break [] 2025-12-04T09:58:54.1610993Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1611036Z Autotune Choices Stats: 2025-12-04T09:58:54.1611774Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.1611905Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1612025Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1612187Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1612803Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1613416Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1614022Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1614649Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1615260Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1615883Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1616524Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1617126Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1617730Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1618344Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1618476Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.1618520Z Autotune Choices Stats: 2025-12-04T09:58:54.1619282Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.1619526Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1619689Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1619967Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1620608Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1621230Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1621853Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1622490Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1623127Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1623758Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1624396Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1625019Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1625644Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1626304Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1626457Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.1626535Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1626577Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1626615Z unimplemented [] 2025-12-04T09:58:54.1626676Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1626778Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1627364Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1627415Z graph_break [] 2025-12-04T09:58:54.1627488Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1627530Z Autotune Choices Stats: 2025-12-04T09:58:54.1628285Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.1628413Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1628530Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1628694Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1629310Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1629914Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1630537Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1631141Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1631753Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1632381Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1632993Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1633599Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1634200Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1634810Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1634946Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.1634990Z Autotune Choices Stats: 2025-12-04T09:58:54.1635744Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.1636007Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1636185Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1636480Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1637111Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1637738Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1638361Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1638987Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1639627Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1640264Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1640901Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1641544Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1642170Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1642799Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1642927Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.1643001Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1643043Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1643082Z unimplemented [] 2025-12-04T09:58:54.1643145Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1643259Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1643830Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1643870Z graph_break [] 2025-12-04T09:58:54.1643943Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1643984Z Autotune Choices Stats: 2025-12-04T09:58:54.1644743Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.1644891Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1645006Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1645167Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1645778Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1646417Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1647023Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1647646Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1648249Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1648883Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1649517Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1650124Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1650727Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1651331Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1651462Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.1651506Z Autotune Choices Stats: 2025-12-04T09:58:54.1652272Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.1652490Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1652663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1652951Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1653591Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1654229Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1654854Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1655476Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1656154Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1656777Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1657416Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1658065Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1658689Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1659318Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1659446Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.1659523Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1661619Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1661662Z unimplemented [] 2025-12-04T09:58:54.1661727Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1661830Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1662437Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1662478Z graph_break [] 2025-12-04T09:58:54.1662555Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1662594Z Autotune Choices Stats: 2025-12-04T09:58:54.1663337Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.1663482Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1663610Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1663781Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1664385Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1664993Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1665597Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1666226Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1666856Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1667462Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1668078Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1668704Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1669306Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1669904Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1670033Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.1670074Z Autotune Choices Stats: 2025-12-04T09:58:54.1670830Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.1671058Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1671231Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1671508Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1672148Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1672788Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1673418Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1674042Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1674669Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1675308Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1675973Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1676613Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1677261Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1677885Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1678014Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.1678095Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1678139Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1678177Z unimplemented [] 2025-12-04T09:58:54.1678238Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1678340Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1678919Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1678957Z graph_break [] 2025-12-04T09:58:54.1679031Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1679072Z Autotune Choices Stats: 2025-12-04T09:58:54.1679828Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.1679958Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1680071Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1680233Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1680852Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1681482Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1682082Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1682681Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1683284Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1683895Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1684496Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1685105Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1685726Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1686378Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1686508Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.1686549Z Autotune Choices Stats: 2025-12-04T09:58:54.1687305Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.1687523Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1687687Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1687980Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1688606Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1689252Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1689896Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1690516Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1691142Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1691766Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1692402Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1693021Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1693659Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1694293Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1694424Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.1694498Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1694543Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1694580Z unimplemented [] 2025-12-04T09:58:54.1694640Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1694737Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1695325Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.1695364Z graph_break [] 2025-12-04T09:58:54.1695439Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1695479Z Autotune Choices Stats: 2025-12-04T09:58:54.1696258Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.1696404Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1696519Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1696680Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1697314Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1697915Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1698553Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1699154Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1699761Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1700363Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1700976Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1701586Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1702196Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1702809Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1702940Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.1702980Z Autotune Choices Stats: 2025-12-04T09:58:54.1703746Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.1703965Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1704127Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1704404Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1705044Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1705676Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1706349Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1706987Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1707620Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1708247Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1708870Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1709509Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1710137Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1710773Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1710920Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.1711012Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.1711060Z Traceback (most recent call last): 2025-12-04T09:58:54.1711212Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.1711253Z self.assertTrue( 2025-12-04T09:58:54.1711359Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.1711409Z raise self.failureException(msg) 2025-12-04T09:58:54.1711538Z AssertionError: False is not true : Log file /tmp/tmp3pg9sr7g/flex_attention_configs.json was not created 2025-12-04T09:58:54.1711542Z 2025-12-04T09:58:54.1711620Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.1711784Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.1711788Z 2025-12-04T09:58:54.1711877Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.1711952Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1711993Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1712031Z unimplemented [] 2025-12-04T09:58:54.1712092Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1712674Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.1712773Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1712810Z graph_break [] 2025-12-04T09:58:54.1712883Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1713388Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.1713437Z current_size = base.storage().size() 2025-12-04T09:58:54.1713479Z Autotune Choices Stats: 2025-12-04T09:58:54.1714242Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.1714381Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1714495Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1714669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1715280Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1715885Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1716590Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1717195Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1717819Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1718419Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1719031Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1719654Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1720252Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1720854Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1720985Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.1721025Z Autotune Choices Stats: 2025-12-04T09:58:54.1721777Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.1722004Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1722170Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1722446Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1723093Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1723736Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1724354Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1724974Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1725593Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1726269Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1726888Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1727529Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1728176Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1728798Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1728926Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.1729090Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1729132Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1729170Z unimplemented [] 2025-12-04T09:58:54.1729232Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1729335Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1729907Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1729948Z graph_break [] 2025-12-04T09:58:54.1730023Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1730062Z Autotune Choices Stats: 2025-12-04T09:58:54.1730813Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.1730942Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1731056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1731216Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1731830Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1732454Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1733059Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1733658Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1734259Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1734866Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1735469Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1736117Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1736740Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1737344Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1737473Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.1737514Z Autotune Choices Stats: 2025-12-04T09:58:54.1738269Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.1738486Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1738650Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1738943Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1739572Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1740205Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1740845Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1741469Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1742098Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1742722Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1743366Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1743989Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1744614Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1745262Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1745394Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.1745469Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1745510Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1745550Z unimplemented [] 2025-12-04T09:58:54.1745611Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1745711Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1746325Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.1746364Z graph_break [] 2025-12-04T09:58:54.1746438Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1746477Z Autotune Choices Stats: 2025-12-04T09:58:54.1747227Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.1747370Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1747485Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1747719Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1748324Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1748939Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1749569Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1750167Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1750776Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1751387Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1752005Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1752604Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1753214Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1753839Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1753968Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.1754008Z Autotune Choices Stats: 2025-12-04T09:58:54.1754766Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.1754982Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1755145Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1755425Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1756106Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1756735Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1757374Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1758021Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1758647Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1759273Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1759896Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1760532Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1761156Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1761791Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1761944Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.1762018Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1762060Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1762097Z unimplemented [] 2025-12-04T09:58:54.1762157Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1762257Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1762836Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1762873Z graph_break [] 2025-12-04T09:58:54.1762946Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1762985Z Autotune Choices Stats: 2025-12-04T09:58:54.1763742Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.1763872Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1763983Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1764143Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1764765Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1765381Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1766036Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1766655Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1767254Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1767861Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1768459Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1769076Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1769673Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1770288Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1770439Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.1770478Z Autotune Choices Stats: 2025-12-04T09:58:54.1771236Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.1771452Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1771619Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1771898Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1772526Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1773161Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1773788Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1774418Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1775063Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1775691Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1776350Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1776969Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1777616Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1778242Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1778391Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.1778475Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1778517Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1778567Z unimplemented [] 2025-12-04T09:58:54.1778629Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1778728Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1779298Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1779337Z graph_break [] 2025-12-04T09:58:54.1779410Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1779449Z Autotune Choices Stats: 2025-12-04T09:58:54.1780184Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.1780314Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1780428Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1780587Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1781195Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1781813Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1782422Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1783035Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1783648Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1784248Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1784854Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1785459Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1786115Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1786738Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1786881Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.1786921Z Autotune Choices Stats: 2025-12-04T09:58:54.1787682Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.1787914Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1788077Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1788356Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1788991Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1789613Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1790243Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1790870Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1791511Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1792161Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1792786Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1793413Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1794036Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1794677Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1794806Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.1794880Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1794923Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1794959Z unimplemented [] 2025-12-04T09:58:54.1795020Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1795130Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1795710Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1795759Z graph_break [] 2025-12-04T09:58:54.1795833Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1795875Z Autotune Choices Stats: 2025-12-04T09:58:54.1796650Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.1796781Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1796897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1797058Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1797668Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1798270Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1798893Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1799509Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1800123Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1800737Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1801345Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1801951Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1802555Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1803168Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1803301Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.1803342Z Autotune Choices Stats: 2025-12-04T09:58:54.1804107Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.1804342Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1804506Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1804785Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1805424Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1806079Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1806703Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1807344Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1807982Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1808616Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1809260Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1809887Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1810513Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1811133Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1811274Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.1811353Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1811395Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1811435Z unimplemented [] 2025-12-04T09:58:54.1811496Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1811597Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1812179Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.1812229Z graph_break [] 2025-12-04T09:58:54.1812301Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1812343Z Autotune Choices Stats: 2025-12-04T09:58:54.1813075Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.1813215Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1813333Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1813492Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1814108Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1814708Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1815326Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1815960Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1816581Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1817191Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1817810Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1818415Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1819017Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1819623Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1819764Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.1819806Z Autotune Choices Stats: 2025-12-04T09:58:54.1820556Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.1820783Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1820956Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1821246Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1821879Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1822500Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1823124Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1823751Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1824385Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1825021Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1825658Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1826328Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1826953Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1827578Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1827706Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.1827781Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1827822Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1827860Z unimplemented [] 2025-12-04T09:58:54.1827921Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1828023Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1828610Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1828651Z graph_break [] 2025-12-04T09:58:54.1828724Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1828764Z Autotune Choices Stats: 2025-12-04T09:58:54.1829528Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.1829679Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1829795Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1829954Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1830569Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1831176Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1831779Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1832394Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1832990Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1833610Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1834232Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1834835Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1835438Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1836086Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1836216Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.1836258Z Autotune Choices Stats: 2025-12-04T09:58:54.1837029Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.1837249Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1837417Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1837705Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1838347Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1838978Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1839603Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1840226Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1840853Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1841490Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1842130Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1842764Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1843398Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1844022Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1844150Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.1844226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1844267Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1844306Z unimplemented [] 2025-12-04T09:58:54.1844366Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1844466Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1845039Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1845078Z graph_break [] 2025-12-04T09:58:54.1845165Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1845204Z Autotune Choices Stats: 2025-12-04T09:58:54.1845993Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.1846139Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1846271Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1846431Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1847065Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1847672Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1848274Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1848880Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1849500Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1850102Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1850718Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1851338Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1851942Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1852545Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1852673Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.1852714Z Autotune Choices Stats: 2025-12-04T09:58:54.1853472Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.1853698Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1853863Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1854137Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1854775Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1855411Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1856100Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1856722Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1857351Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1858006Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1858634Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1859270Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1859919Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1860544Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1860672Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.1860752Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1860793Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1860831Z unimplemented [] 2025-12-04T09:58:54.1860890Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1860989Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1861573Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1861611Z graph_break [] 2025-12-04T09:58:54.1861683Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1861724Z Autotune Choices Stats: 2025-12-04T09:58:54.1862469Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.1862598Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1862713Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1862874Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1863502Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1864122Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1864727Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1865334Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1865974Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1866592Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1867194Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1867814Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1868439Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1869043Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1869172Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.1869211Z Autotune Choices Stats: 2025-12-04T09:58:54.1869980Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.1870195Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1870362Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1870656Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1871277Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1871915Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1872558Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1873178Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1873811Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1874435Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1875068Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1875700Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1876378Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1877026Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1877159Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.1877232Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1877277Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1877314Z unimplemented [] 2025-12-04T09:58:54.1877374Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1877472Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1878047Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1878083Z graph_break [] 2025-12-04T09:58:54.1878158Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1878197Z Autotune Choices Stats: 2025-12-04T09:58:54.1878936Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.1879066Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1879191Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1879354Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1879965Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1880580Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1881202Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1881812Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1882411Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1883006Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1883623Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1884226Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1884837Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1885458Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1885587Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.1885627Z Autotune Choices Stats: 2025-12-04T09:58:54.1886420Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.1886636Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1886801Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1887078Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1887723Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1888347Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1888984Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1889643Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1890268Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1890897Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1891516Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1892167Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1892792Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1893422Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1893567Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.1893641Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1893687Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1893724Z unimplemented [] 2025-12-04T09:58:54.1893786Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1893884Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1894460Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1894499Z graph_break [] 2025-12-04T09:58:54.1894574Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1894614Z Autotune Choices Stats: 2025-12-04T09:58:54.1895357Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.1895488Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1895601Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1895764Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1896424Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1897036Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1897642Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1898268Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1898865Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1899468Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1900074Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1900689Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1901287Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1901902Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1902054Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.1902093Z Autotune Choices Stats: 2025-12-04T09:58:54.1902848Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.1903065Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1903228Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1903506Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1904143Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1904783Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1905403Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1906087Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1906736Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1907362Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1907994Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1908622Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1909259Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1909884Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1910024Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.1910107Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1910149Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1910186Z unimplemented [] 2025-12-04T09:58:54.1910259Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1910357Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1910932Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.1910969Z graph_break [] 2025-12-04T09:58:54.1911046Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1911089Z Autotune Choices Stats: 2025-12-04T09:58:54.1911832Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.1911962Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1912077Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1912238Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1912847Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1913458Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1914072Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1914684Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1915295Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1915902Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1916538Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1917138Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1917765Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1918367Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1918512Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.1918577Z Autotune Choices Stats: 2025-12-04T09:58:54.1919330Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.1919561Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1919728Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1920012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1920644Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1921268Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1921917Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1922538Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1923175Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1923830Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1924454Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1925083Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1925707Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1926375Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1926506Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.1926583Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1926625Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1926664Z unimplemented [] 2025-12-04T09:58:54.1926725Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1926828Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1927415Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1927480Z graph_break [] 2025-12-04T09:58:54.1927554Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1927595Z Autotune Choices Stats: 2025-12-04T09:58:54.1928338Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.1928467Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1928581Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1928742Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1929355Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1929963Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1930572Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1931185Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1931802Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1932417Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1933016Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1933620Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1934229Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1934837Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1934969Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.1935012Z Autotune Choices Stats: 2025-12-04T09:58:54.1935778Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.1936057Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1936223Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1936503Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1937139Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1937762Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1938381Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1939019Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1939657Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1940298Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1940930Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1941550Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1942179Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1942801Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1942928Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.1943018Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1943061Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1943100Z unimplemented [] 2025-12-04T09:58:54.1943161Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1943264Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1943848Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1943887Z graph_break [] 2025-12-04T09:58:54.1943973Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1944015Z Autotune Choices Stats: 2025-12-04T09:58:54.1944747Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.1944885Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1945001Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1945160Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1945770Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1946416Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1947021Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1947640Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1948268Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1948880Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1949501Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1950107Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1950711Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1951315Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1951455Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.1951498Z Autotune Choices Stats: 2025-12-04T09:58:54.1952254Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.1952483Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1952667Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1952950Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1953575Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1954204Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1954825Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1955447Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1956127Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1956766Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1957394Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1958031Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1958662Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1959283Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1959411Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.1959488Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1959530Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1959569Z unimplemented [] 2025-12-04T09:58:54.1959630Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1959731Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1960321Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1960362Z graph_break [] 2025-12-04T09:58:54.1960437Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1960478Z Autotune Choices Stats: 2025-12-04T09:58:54.1961232Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.1961379Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1961494Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1961652Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1962272Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1962876Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1963480Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1964094Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1964702Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1965319Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1965960Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1966580Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1967179Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1967782Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1967912Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.1967953Z Autotune Choices Stats: 2025-12-04T09:58:54.1968727Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.1968944Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1969110Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1969408Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1970048Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1970684Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1971309Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1971931Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1972561Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1973194Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1973827Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1974457Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1975093Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1975715Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1975843Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.1975958Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1976003Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1976041Z unimplemented [] 2025-12-04T09:58:54.1976101Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1976202Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1976778Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1976817Z graph_break [] 2025-12-04T09:58:54.1976910Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1976951Z Autotune Choices Stats: 2025-12-04T09:58:54.1977688Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.1977828Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1977952Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1978114Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1978738Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1979341Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1979948Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1980557Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1981176Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1981781Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1982398Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1983018Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1983614Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1984218Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1984350Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.1984391Z Autotune Choices Stats: 2025-12-04T09:58:54.1985148Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.1985365Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1985548Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1985823Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1986523Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1987159Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1987792Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1988413Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1989045Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1989686Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1990313Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1990948Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1991591Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1992215Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1992346Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.1992420Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.1992465Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.1992503Z unimplemented [] 2025-12-04T09:58:54.1992565Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.1992664Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.1993243Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.1993279Z graph_break [] 2025-12-04T09:58:54.1993354Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.1993394Z Autotune Choices Stats: 2025-12-04T09:58:54.1994143Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.1994275Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.1994395Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.1994560Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.1995179Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1995796Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1996430Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.1997040Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1997639Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1998261Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1998875Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.1999489Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2000116Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2000721Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2000858Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.2000897Z Autotune Choices Stats: 2025-12-04T09:58:54.2001655Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.2001871Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2002032Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2002321Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2002955Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2003591Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2004233Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2004855Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2005481Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2006137Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2006773Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2007396Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2008029Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2008675Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2008807Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.2008883Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2008930Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2008967Z unimplemented [] 2025-12-04T09:58:54.2009029Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2009127Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2009702Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2009743Z graph_break [] 2025-12-04T09:58:54.2009819Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2009861Z Autotune Choices Stats: 2025-12-04T09:58:54.2010599Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.2010732Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2010853Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2011019Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2011631Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2012241Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2012868Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2013467Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2014070Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2014673Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2015285Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2015885Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2016507Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2017148Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2017287Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.2017329Z Autotune Choices Stats: 2025-12-04T09:58:54.2018083Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.2018303Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2018469Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2018750Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2019395Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2020017Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2020661Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2021304Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2021930Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2022556Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2023185Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2023823Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2024451Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2025084Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2025232Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.2025307Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2025350Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2025389Z unimplemented [] 2025-12-04T09:58:54.2025452Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2025550Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2026154Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2026196Z graph_break [] 2025-12-04T09:58:54.2026270Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2026311Z Autotune Choices Stats: 2025-12-04T09:58:54.2027058Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.2027187Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2027301Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2027464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2028090Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2028690Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2029306Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2029929Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2030535Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2031142Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2031738Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2032351Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2032951Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2033569Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2033716Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.2033757Z Autotune Choices Stats: 2025-12-04T09:58:54.2034518Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.2034739Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2034904Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2035185Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2035822Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2036491Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2037119Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2037759Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2038416Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2039051Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2039677Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2040307Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2040953Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2041575Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2041717Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.2041808Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2041853Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2041894Z unimplemented [] 2025-12-04T09:58:54.2041955Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2042071Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2042642Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2042682Z graph_break [] 2025-12-04T09:58:54.2042758Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2042800Z Autotune Choices Stats: 2025-12-04T09:58:54.2043540Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.2043669Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2043784Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2043945Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2044556Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2045172Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2045784Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2046437Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2047058Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2047659Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2048260Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2048868Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2049488Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2050089Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2050234Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.2050292Z Autotune Choices Stats: 2025-12-04T09:58:54.2051043Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.2051273Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2051441Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2051719Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2052353Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2052980Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2053620Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2054246Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2054884Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2055528Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2056195Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2056827Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2057455Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2058102Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2058231Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.2058309Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2058351Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2058392Z unimplemented [] 2025-12-04T09:58:54.2058454Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2058555Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2059145Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2059212Z graph_break [] 2025-12-04T09:58:54.2059286Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2059331Z Autotune Choices Stats: 2025-12-04T09:58:54.2060071Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.2060201Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2060318Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2060479Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2061095Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2061700Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2062311Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2062929Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2063539Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2064166Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2064773Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2065382Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2066024Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2066643Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2066774Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.2066818Z Autotune Choices Stats: 2025-12-04T09:58:54.2067591Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.2067833Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2068001Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2068280Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2068921Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2069548Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2070169Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2070806Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2071437Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2072076Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2072718Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2073355Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2073979Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2074602Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2074732Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.2074820Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2074863Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2074905Z unimplemented [] 2025-12-04T09:58:54.2074967Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2075069Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2075641Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2075692Z graph_break [] 2025-12-04T09:58:54.2075770Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2075821Z Autotune Choices Stats: 2025-12-04T09:58:54.2076603Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.2076745Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2076865Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2077023Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2077637Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2078244Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2078853Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2079468Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2080080Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2080701Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2081326Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2081926Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2082532Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2083140Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2083268Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.2083320Z Autotune Choices Stats: 2025-12-04T09:58:54.2084083Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.2084309Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2084483Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2084767Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2085408Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2086070Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2086697Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2087329Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2087971Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2088614Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2089252Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2089900Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2090522Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2091155Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2091286Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.2091379Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.2091429Z Traceback (most recent call last): 2025-12-04T09:58:54.2091581Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.2091623Z self.assertTrue( 2025-12-04T09:58:54.2091729Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.2091793Z raise self.failureException(msg) 2025-12-04T09:58:54.2091919Z AssertionError: False is not true : Log file /tmp/tmp0frn1eqy/flex_attention_configs.json was not created 2025-12-04T09:58:54.2091922Z 2025-12-04T09:58:54.2091998Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.2092163Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.2092166Z 2025-12-04T09:58:54.2092257Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.2092333Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2092381Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2092417Z unimplemented [] 2025-12-04T09:58:54.2092480Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2093067Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.2093190Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2093228Z graph_break [] 2025-12-04T09:58:54.2093302Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2093798Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.2093848Z current_size = base.storage().size() 2025-12-04T09:58:54.2093899Z Autotune Choices Stats: 2025-12-04T09:58:54.2094645Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.2094776Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2094892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2095055Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2095664Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2096318Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2096939Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2097548Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2098165Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2098763Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2099365Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2099960Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2100569Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2101164Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2101307Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.2101357Z Autotune Choices Stats: 2025-12-04T09:58:54.2102113Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.2102417Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2102589Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2102868Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2103500Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2104119Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2104755Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2105377Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2106050Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2106699Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2107323Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2107947Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2108570Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2109208Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2109338Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.2109415Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2109458Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2109499Z unimplemented [] 2025-12-04T09:58:54.2109561Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2109663Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2110244Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2110304Z graph_break [] 2025-12-04T09:58:54.2110377Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2110418Z Autotune Choices Stats: 2025-12-04T09:58:54.2111155Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.2111285Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2111398Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2111559Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2112171Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2112781Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2113396Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2114007Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2114617Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2115229Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2115831Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2116478Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2117082Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2117702Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2117835Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.2117875Z Autotune Choices Stats: 2025-12-04T09:58:54.2118643Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.2118888Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2119055Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2119333Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2119968Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2120587Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2121213Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2121849Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2122484Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2123114Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2123755Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2124379Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2125008Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2125632Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2125761Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.2125849Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2125890Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2125978Z unimplemented [] 2025-12-04T09:58:54.2126039Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2126140Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2126713Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2126773Z graph_break [] 2025-12-04T09:58:54.2126861Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2126904Z Autotune Choices Stats: 2025-12-04T09:58:54.2127642Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.2127788Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2127905Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2128065Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2128678Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2129286Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2129883Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2130498Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2131123Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2131729Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2132341Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2132940Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2133545Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2134147Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2134275Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.2134332Z Autotune Choices Stats: 2025-12-04T09:58:54.2135084Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.2135314Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2135491Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2135778Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2136445Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2137062Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2137684Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2138307Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2138948Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2139590Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2140221Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2140861Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2141495Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2142118Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2142246Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.2142321Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2142362Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2142399Z unimplemented [] 2025-12-04T09:58:54.2142459Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2142558Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2143149Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2143189Z graph_break [] 2025-12-04T09:58:54.2143263Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2143302Z Autotune Choices Stats: 2025-12-04T09:58:54.2144053Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.2144204Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2144318Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2144477Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2145092Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2145695Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2146337Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2146935Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2147562Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2148176Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2148784Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2149402Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2150009Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2150611Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2150740Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.2150781Z Autotune Choices Stats: 2025-12-04T09:58:54.2151550Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.2151765Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2151928Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2152211Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2152856Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2153490Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2154112Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2154738Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2155366Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2156040Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2156674Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2157313Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2157956Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2158577Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2158705Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.2158781Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2158822Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2158859Z unimplemented [] 2025-12-04T09:58:54.2158919Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2159019Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2159595Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2159632Z graph_break [] 2025-12-04T09:58:54.2159720Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2159762Z Autotune Choices Stats: 2025-12-04T09:58:54.2160500Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.2160627Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2160763Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2160935Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2163058Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2163668Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2164278Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2164878Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2165498Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2166142Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2166771Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2167385Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2168004Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2168609Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2168738Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.2168780Z Autotune Choices Stats: 2025-12-04T09:58:54.2169531Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.2169749Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2169937Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2170214Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2170849Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2171477Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2172118Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2172737Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2173363Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2173991Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2174630Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2175269Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2175907Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2176569Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2176698Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.2176773Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2176817Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2176855Z unimplemented [] 2025-12-04T09:58:54.2176916Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2177014Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2177589Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2177626Z graph_break [] 2025-12-04T09:58:54.2177702Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2177742Z Autotune Choices Stats: 2025-12-04T09:58:54.2178500Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.2178631Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2178746Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2178907Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2179524Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2180155Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2180758Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2181365Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2181973Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2182587Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2183192Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2183805Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2184425Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2185025Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2185156Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.2185196Z Autotune Choices Stats: 2025-12-04T09:58:54.2185978Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.2186195Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2186358Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2186640Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2187285Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2187927Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2188565Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2189203Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2189831Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2190453Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2191086Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2191716Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2192351Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2193002Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2193134Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.2193211Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2193254Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2193291Z unimplemented [] 2025-12-04T09:58:54.2193352Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2193452Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2194026Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2194063Z graph_break [] 2025-12-04T09:58:54.2194140Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2194181Z Autotune Choices Stats: 2025-12-04T09:58:54.2194924Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.2195052Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2195177Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2195340Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2195996Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2196617Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2197247Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2199443Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2200671Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2201283Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2201913Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2202511Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2203125Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2203717Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2203848Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.2203937Z Autotune Choices Stats: 2025-12-04T09:58:54.2204691Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.2204924Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2205089Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2205365Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2206056Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2206677Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2207317Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2207938Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2208564Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2209211Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2209851Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2210499Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2211121Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2211752Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2211883Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.2211960Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2212003Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2212041Z unimplemented [] 2025-12-04T09:58:54.2212102Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2212203Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2212778Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2212826Z graph_break [] 2025-12-04T09:58:54.2212899Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2212951Z Autotune Choices Stats: 2025-12-04T09:58:54.2213689Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.2213818Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2213934Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2214095Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2214717Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2215316Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2215968Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2216575Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2217187Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2217797Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2218402Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2219015Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2219612Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2220228Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2220359Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.2220400Z Autotune Choices Stats: 2025-12-04T09:58:54.2221155Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.2221382Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2221547Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2221836Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2222470Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2223103Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2223722Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2224357Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2224981Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2225605Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2226286Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2226928Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2227564Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2228188Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2228328Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.2228404Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2228446Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2228483Z unimplemented [] 2025-12-04T09:58:54.2228543Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2228643Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2229215Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2229252Z graph_break [] 2025-12-04T09:58:54.2229328Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2229384Z Autotune Choices Stats: 2025-12-04T09:58:54.2230122Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.2230261Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2230377Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2230537Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2231145Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2231759Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2232362Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2232972Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2233577Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2234194Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2234805Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2235403Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2236065Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2236668Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2236808Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.2236850Z Autotune Choices Stats: 2025-12-04T09:58:54.2237599Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.2237816Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2237996Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2238275Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2238905Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2239538Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2240172Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2240796Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2241430Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2242056Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2242706Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2243337Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2243960Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2244600Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2244729Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.2244803Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2244845Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2244881Z unimplemented [] 2025-12-04T09:58:54.2244941Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2245040Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2245624Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2245663Z graph_break [] 2025-12-04T09:58:54.2245737Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2245777Z Autotune Choices Stats: 2025-12-04T09:58:54.2246545Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.2246685Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2246800Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2246974Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2247579Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2248182Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2248803Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2249416Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2250016Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2250621Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2251230Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2251841Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2252445Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2253058Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2253185Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.2253226Z Autotune Choices Stats: 2025-12-04T09:58:54.2253989Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.2254204Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2254371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2254649Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2255288Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2255957Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2256589Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2257227Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2257852Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2258504Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2259122Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2259760Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2260399Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2261033Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2261161Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.2261248Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2261290Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2261327Z unimplemented [] 2025-12-04T09:58:54.2261387Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2261486Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2262063Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2262100Z graph_break [] 2025-12-04T09:58:54.2262183Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2262225Z Autotune Choices Stats: 2025-12-04T09:58:54.2262963Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.2263090Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2263205Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2263378Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2263979Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2264593Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2265192Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2265802Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2266461Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2267060Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2267663Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2268280Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2268895Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2269492Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2269622Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.2269663Z Autotune Choices Stats: 2025-12-04T09:58:54.2270433Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.2270648Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2270825Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2271101Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2271729Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2272377Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2273009Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2273628Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2274272Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2274904Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2275528Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2276203Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2276842Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2277475Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2277603Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.2277676Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2277719Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2277756Z unimplemented [] 2025-12-04T09:58:54.2277817Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2277914Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2278513Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2278551Z graph_break [] 2025-12-04T09:58:54.2278624Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2278664Z Autotune Choices Stats: 2025-12-04T09:58:54.2279416Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.2279543Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2279656Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2279818Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2280431Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2281043Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2281657Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2282260Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2282871Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2283485Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2284085Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2284696Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2285305Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2285914Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2286081Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.2286120Z Autotune Choices Stats: 2025-12-04T09:58:54.2286893Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.2287110Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2287273Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2287563Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2288191Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2288818Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2289449Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2290090Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2290717Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2291350Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2291979Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2292603Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2293231Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2293863Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2294002Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.2294079Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2294122Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2294160Z unimplemented [] 2025-12-04T09:58:54.2294220Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2294321Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2294895Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2294936Z graph_break [] 2025-12-04T09:58:54.2295010Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2295065Z Autotune Choices Stats: 2025-12-04T09:58:54.2295803Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.2295966Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2296096Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2296256Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2296869Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2297480Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2298095Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2298715Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2299314Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2299930Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2300546Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2301151Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2301750Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2302363Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2302501Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.2302543Z Autotune Choices Stats: 2025-12-04T09:58:54.2303297Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.2303512Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2303691Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2303965Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2304614Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2305240Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2305862Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2306538Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2307183Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2307809Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2308445Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2309091Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2309712Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2310329Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2310470Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.2310545Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2310596Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2310634Z unimplemented [] 2025-12-04T09:58:54.2310695Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2310796Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2311373Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2311410Z graph_break [] 2025-12-04T09:58:54.2311483Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2311524Z Autotune Choices Stats: 2025-12-04T09:58:54.2312275Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.2312403Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2312519Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2312681Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2313305Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2313907Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2314510Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2315122Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2315734Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2316389Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2316992Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2317609Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2318207Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2318815Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2318956Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.2318998Z Autotune Choices Stats: 2025-12-04T09:58:54.2319759Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.2319977Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2320141Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2320418Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2321066Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2321708Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2322330Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2322954Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2323592Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2324227Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2324858Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2325486Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2326161Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2326782Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2326910Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.2326988Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2327046Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2327085Z unimplemented [] 2025-12-04T09:58:54.2327145Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2327245Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2327821Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2327875Z graph_break [] 2025-12-04T09:58:54.2327951Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2327992Z Autotune Choices Stats: 2025-12-04T09:58:54.2328727Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.2328856Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2328984Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2329147Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2329753Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2330367Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2330972Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2331575Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2332188Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2332804Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2333421Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2334020Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2334632Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2335239Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2335367Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.2335421Z Autotune Choices Stats: 2025-12-04T09:58:54.2336199Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.2336437Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2336604Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2336884Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2337535Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2338159Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2338793Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2339413Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2340041Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2340678Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2341320Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2341957Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2342582Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2343217Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2343348Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.2343422Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2343466Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2343502Z unimplemented [] 2025-12-04T09:58:54.2343563Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2343662Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2344238Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2344286Z graph_break [] 2025-12-04T09:58:54.2344363Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2344403Z Autotune Choices Stats: 2025-12-04T09:58:54.2345149Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.2345276Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2345388Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2345552Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2346204Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2346815Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2347428Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2348028Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2348641Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2349260Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2349861Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2350472Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2351073Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2351684Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2351815Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.2351855Z Autotune Choices Stats: 2025-12-04T09:58:54.2352604Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.2352833Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2352998Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2353291Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2353918Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2354551Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2355177Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2355809Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2356492Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2357126Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2357767Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2358405Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2359046Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2359667Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2359794Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.2359883Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2359927Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2359965Z unimplemented [] 2025-12-04T09:58:54.2360028Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2360128Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2360698Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2360735Z graph_break [] 2025-12-04T09:58:54.2360810Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2360865Z Autotune Choices Stats: 2025-12-04T09:58:54.2361610Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.2361748Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2361862Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2362023Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2362632Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2363249Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2363851Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2364471Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2365078Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2365692Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2366346Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2366945Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2367571Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2368169Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2368298Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.2368354Z Autotune Choices Stats: 2025-12-04T09:58:54.2369115Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.2369333Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2369499Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2369802Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2370431Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2371083Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2371712Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2372333Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2372969Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2373593Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2374225Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2374869Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2375494Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2376171Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2376303Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.2376378Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2376424Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2376461Z unimplemented [] 2025-12-04T09:58:54.2376523Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2376620Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2377212Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2377251Z graph_break [] 2025-12-04T09:58:54.2377323Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2377364Z Autotune Choices Stats: 2025-12-04T09:58:54.2378099Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.2378242Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2378358Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2378530Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2379134Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2379736Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2380353Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2380956Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2381565Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2382165Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2382787Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2383400Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2383999Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2384613Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2384742Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.2384781Z Autotune Choices Stats: 2025-12-04T09:58:54.2385558Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.2385775Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2385969Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2386245Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2386899Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2387540Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2388165Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2388804Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2389431Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2390067Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2390694Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2391332Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2391966Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2392592Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2392721Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.2392798Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2392852Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2392892Z unimplemented [] 2025-12-04T09:58:54.2392953Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2393052Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2393625Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2393664Z graph_break [] 2025-12-04T09:58:54.2393747Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2393792Z Autotune Choices Stats: 2025-12-04T09:58:54.2394541Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.2394671Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2394787Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2394960Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2395570Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2396229Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2396833Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2397548Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2398163Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2398765Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2399364Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2399978Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2400598Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2401195Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2401326Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.2401368Z Autotune Choices Stats: 2025-12-04T09:58:54.2402137Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.2402356Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2402531Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2402812Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2403441Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2404072Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2404711Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2405332Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2406013Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2406645Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2407279Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2407908Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2408556Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2409195Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2409323Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.2409400Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2409444Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2409482Z unimplemented [] 2025-12-04T09:58:54.2409541Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2409641Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2410223Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2410263Z graph_break [] 2025-12-04T09:58:54.2410336Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2410379Z Autotune Choices Stats: 2025-12-04T09:58:54.2411123Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.2411251Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2411368Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2411526Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2412138Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2412746Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2413356Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2413958Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2414576Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2415191Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2415791Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2416440Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2417053Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2417675Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2417804Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.2417847Z Autotune Choices Stats: 2025-12-04T09:58:54.2418624Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.2418843Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2419008Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2419286Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2419936Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2420559Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2421189Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2421821Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2422448Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2423087Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2423711Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2424336Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2424962Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2425594Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2425729Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.2425806Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2425852Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2425892Z unimplemented [] 2025-12-04T09:58:54.2425989Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2426088Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2426655Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2426695Z graph_break [] 2025-12-04T09:58:54.2426770Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2426810Z Autotune Choices Stats: 2025-12-04T09:58:54.2427557Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.2427683Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2427812Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2427973Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2428583Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2429187Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2429806Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2430413Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2431009Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2431621Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2432239Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2432841Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2433442Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2434055Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2434193Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.2434236Z Autotune Choices Stats: 2025-12-04T09:58:54.2434993Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.2435211Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2435387Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2435663Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2436359Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2436985Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2437606Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2438246Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2438881Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2439512Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2440153Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2440784Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2441407Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2442035Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2442173Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.2442250Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2442293Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2442342Z unimplemented [] 2025-12-04T09:58:54.2442402Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2442503Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2443075Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2443113Z graph_break [] 2025-12-04T09:58:54.2443188Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2443227Z Autotune Choices Stats: 2025-12-04T09:58:54.2443991Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.2444118Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2444233Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2444399Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2445018Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2445624Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2446260Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2446872Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2447484Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2448099Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2448698Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2449312Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2449919Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2450521Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2450660Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.2450702Z Autotune Choices Stats: 2025-12-04T09:58:54.2451460Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.2451687Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2451855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2452140Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2452781Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2453416Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2454041Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2454662Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2455297Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2455985Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2456689Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2457335Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2457976Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2458601Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2458730Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.2458805Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2458863Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2458899Z unimplemented [] 2025-12-04T09:58:54.2458959Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2459057Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2459629Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2459678Z graph_break [] 2025-12-04T09:58:54.2459754Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2459794Z Autotune Choices Stats: 2025-12-04T09:58:54.2460534Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.2460666Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2460780Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2460960Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2461572Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2462191Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2462793Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2463397Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2464010Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2464622Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2465247Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2465855Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2466546Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2467147Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2467277Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.2467317Z Autotune Choices Stats: 2025-12-04T09:58:54.2468096Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.2468323Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2468493Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2468768Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2469404Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2470050Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2470682Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2471309Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2471946Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2472576Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2473210Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2473852Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2474476Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2475111Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2475242Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.2475315Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2475359Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2475394Z unimplemented [] 2025-12-04T09:58:54.2475455Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2475554Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2476171Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2476231Z graph_break [] 2025-12-04T09:58:54.2476305Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2476345Z Autotune Choices Stats: 2025-12-04T09:58:54.2477097Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.2477224Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2477336Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2477499Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2478118Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2478716Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2479349Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2479956Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2480575Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2481196Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2481801Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2482417Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2483021Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2483637Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2483768Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.2483807Z Autotune Choices Stats: 2025-12-04T09:58:54.2484578Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.2484808Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2484971Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2485259Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2485897Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2486576Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2487191Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2487831Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2488463Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2489089Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2489725Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2490371Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2491015Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2491638Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2491769Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.2491870Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.2491921Z Traceback (most recent call last): 2025-12-04T09:58:54.2492077Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.2492119Z self.assertTrue( 2025-12-04T09:58:54.2492224Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.2492276Z raise self.failureException(msg) 2025-12-04T09:58:54.2492402Z AssertionError: False is not true : Log file /tmp/tmpxarurl0o/flex_attention_configs.json was not created 2025-12-04T09:58:54.2492408Z 2025-12-04T09:58:54.2492484Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.2492649Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.2492652Z 2025-12-04T09:58:54.2492740Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.2492826Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2492868Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2492905Z unimplemented [] 2025-12-04T09:58:54.2492965Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2493539Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.2493647Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2493686Z graph_break [] 2025-12-04T09:58:54.2493762Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2494261Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.2494310Z current_size = base.storage().size() 2025-12-04T09:58:54.2494354Z Autotune Choices Stats: 2025-12-04T09:58:54.2495112Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.2495241Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2495359Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2495521Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2496191Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2496794Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2497397Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2498014Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2498633Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2499247Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2499852Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2500460Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2501058Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2501661Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2501804Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.2501845Z Autotune Choices Stats: 2025-12-04T09:58:54.2502612Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.2502831Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2502997Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2503273Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2503919Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2504544Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2505164Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2505794Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2506461Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2507105Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2507752Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2508371Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2509008Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2509633Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2509762Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.2509838Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2509893Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2509930Z unimplemented [] 2025-12-04T09:58:54.2509990Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2510091Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2510659Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2510706Z graph_break [] 2025-12-04T09:58:54.2510783Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2510826Z Autotune Choices Stats: 2025-12-04T09:58:54.2511566Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.2511693Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2511824Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2511986Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2512592Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2513214Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2513825Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2514425Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2515039Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2515656Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2516318Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2516917Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2517531Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2518134Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2518262Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.2518318Z Autotune Choices Stats: 2025-12-04T09:58:54.2519075Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.2519302Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2519468Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2519746Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2520397Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2521024Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2521657Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2522279Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2522914Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2523546Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2524183Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2524819Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2525441Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2526133Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2526268Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.2526342Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2526387Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2526425Z unimplemented [] 2025-12-04T09:58:54.2526488Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2526588Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2527164Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2527214Z graph_break [] 2025-12-04T09:58:54.2527290Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2527330Z Autotune Choices Stats: 2025-12-04T09:58:54.2528087Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.2528217Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2528331Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2528494Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2529129Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2529722Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2530334Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2530934Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2531537Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2532159Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2532771Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2533382Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2533980Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2534596Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2534727Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.2534766Z Autotune Choices Stats: 2025-12-04T09:58:54.2535522Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.2535749Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2535915Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2536248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2536885Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2537537Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2538171Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2538808Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2539432Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2540060Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2540695Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2541335Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2541971Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2542602Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2542730Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.2542812Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2542859Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2542897Z unimplemented [] 2025-12-04T09:58:54.2542958Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2543059Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2543632Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2543668Z graph_break [] 2025-12-04T09:58:54.2543747Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2543804Z Autotune Choices Stats: 2025-12-04T09:58:54.2544561Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.2544711Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2544826Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2544992Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2545599Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2546263Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2546873Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2547493Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2548100Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2548714Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2549328Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2549939Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2550550Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2551151Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2551280Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.2551329Z Autotune Choices Stats: 2025-12-04T09:58:54.2552086Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.2552301Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2552465Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2552758Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2553390Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2554024Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2554662Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2555286Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2555967Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2556592Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2557216Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2557862Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2558490Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2559127Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2559258Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.2559333Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2559375Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2559414Z unimplemented [] 2025-12-04T09:58:54.2559478Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2559576Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2560166Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2560207Z graph_break [] 2025-12-04T09:58:54.2560281Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2560321Z Autotune Choices Stats: 2025-12-04T09:58:54.2561062Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.2561210Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2561322Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2561494Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2562107Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2562712Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2563329Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2563936Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2564552Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2565156Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2565769Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2566426Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2567021Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2567642Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2567773Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.2567813Z Autotune Choices Stats: 2025-12-04T09:58:54.2568592Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.2568809Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2568973Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2569252Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2569884Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2570536Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2571155Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2571787Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2572413Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2573048Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2573674Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2574309Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2574945Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2575569Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2575699Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.2575774Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2575835Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2575873Z unimplemented [] 2025-12-04T09:58:54.2575981Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2576079Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2576652Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2576691Z graph_break [] 2025-12-04T09:58:54.2576791Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2576834Z Autotune Choices Stats: 2025-12-04T09:58:54.2577572Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.2577703Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2577820Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2577998Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2578608Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2579222Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2579824Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2580448Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2581060Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2581662Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2582269Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2582882Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2583492Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2584092Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2584226Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.2584269Z Autotune Choices Stats: 2025-12-04T09:58:54.2585038Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.2585258Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2585434Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2585713Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2586405Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2587047Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2587681Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2588309Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2588959Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2589584Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2590222Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2590852Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2591495Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2592128Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2592258Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.2592333Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2592376Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2592422Z unimplemented [] 2025-12-04T09:58:54.2592481Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2592583Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2593175Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2593215Z graph_break [] 2025-12-04T09:58:54.2593288Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2593328Z Autotune Choices Stats: 2025-12-04T09:58:54.2594070Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.2594201Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2594319Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2594480Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2595091Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2595703Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2596366Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2596970Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2597600Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2598216Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2598816Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2599426Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2600039Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2600658Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2600788Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.2600828Z Autotune Choices Stats: 2025-12-04T09:58:54.2601593Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.2601810Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2601976Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2602253Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2602896Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2603523Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2604153Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2604790Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2605419Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2606107Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2606743Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2607369Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2607991Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2608627Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2608773Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.2608848Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2608893Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2608933Z unimplemented [] 2025-12-04T09:58:54.2608994Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2609096Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2609668Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2609711Z graph_break [] 2025-12-04T09:58:54.2609786Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2609830Z Autotune Choices Stats: 2025-12-04T09:58:54.2610581Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.2610709Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2610836Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2610998Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2611618Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2612227Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2612841Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2613455Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2614058Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2614671Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2615281Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2615885Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2616527Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2617141Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2617282Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.2617323Z Autotune Choices Stats: 2025-12-04T09:58:54.2618080Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.2618297Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2618479Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2618755Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2619398Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2620025Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2620654Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2621283Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2621920Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2622549Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2623183Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2623812Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2624439Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2625069Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2625212Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.2625287Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2625328Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2625376Z unimplemented [] 2025-12-04T09:58:54.2625436Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2625534Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2626144Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2626184Z graph_break [] 2025-12-04T09:58:54.2626258Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2626297Z Autotune Choices Stats: 2025-12-04T09:58:54.2627067Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.2627193Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2627307Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2627471Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2628089Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2628694Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2629297Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2629907Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2630528Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2631130Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2631735Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2632342Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2632947Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2633546Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2633682Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.2633724Z Autotune Choices Stats: 2025-12-04T09:58:54.2634481Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.2634709Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2634873Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2635149Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2635790Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2636467Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2637095Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2637720Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2638359Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2639004Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2639634Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2640273Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2640925Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2641550Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2641685Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.2641760Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2641815Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2641852Z unimplemented [] 2025-12-04T09:58:54.2641917Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2642014Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2642593Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2642643Z graph_break [] 2025-12-04T09:58:54.2642719Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2642762Z Autotune Choices Stats: 2025-12-04T09:58:54.2643503Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.2643631Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2643744Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2643921Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2644533Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2645154Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2645757Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2646396Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2647023Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2647655Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2648281Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2648883Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2649512Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2650113Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2650246Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.2650286Z Autotune Choices Stats: 2025-12-04T09:58:54.2651049Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.2651274Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2651448Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2651729Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2652358Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2652998Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2653627Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2654249Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2654878Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2655517Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2656183Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2656834Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2657458Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2658098Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2658229Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.2658302Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2658345Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2658384Z unimplemented [] 2025-12-04T09:58:54.2658447Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2658546Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2659119Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2659173Z graph_break [] 2025-12-04T09:58:54.2659248Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2659288Z Autotune Choices Stats: 2025-12-04T09:58:54.2660021Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.2660169Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2660282Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2660445Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2661075Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2661679Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2662292Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2662895Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2663493Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2664105Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2664717Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2665334Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2665965Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2666586Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2666714Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.2666753Z Autotune Choices Stats: 2025-12-04T09:58:54.2667515Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.2667746Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2667910Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2668198Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2668827Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2669465Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2670086Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2673576Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2674220Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2674849Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2675480Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2676168Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2676813Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2677437Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2677570Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.2677667Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2677717Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2677758Z unimplemented [] 2025-12-04T09:58:54.2677822Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2677929Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2678507Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2678549Z graph_break [] 2025-12-04T09:58:54.2678625Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2678670Z Autotune Choices Stats: 2025-12-04T09:58:54.2679430Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.2679575Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2679693Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2679863Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2680477Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2681095Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2681701Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2682312Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2682914Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2683527Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2684146Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2684740Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2685348Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2685995Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2686124Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.2686164Z Autotune Choices Stats: 2025-12-04T09:58:54.2686934Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.2687154Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2687320Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2687616Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2688247Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2688879Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2689514Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2690140Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2690777Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2691401Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2692028Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2692671Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2693309Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2693947Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2694078Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.2694154Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2694197Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2694236Z unimplemented [] 2025-12-04T09:58:54.2694298Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2694398Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2694978Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2695018Z graph_break [] 2025-12-04T09:58:54.2695093Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2695133Z Autotune Choices Stats: 2025-12-04T09:58:54.2695869Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.2696044Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2696158Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2696319Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2696947Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2697552Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2698170Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2698775Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2699386Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2699993Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2700608Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2701221Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2701821Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2702434Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2702566Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.2702608Z Autotune Choices Stats: 2025-12-04T09:58:54.2703369Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.2703588Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2703757Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2704037Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2704668Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2705301Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2705978Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2706615Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2707243Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2707885Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2708513Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2709153Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2709793Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2710418Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2710548Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.2710625Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2710669Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2710718Z unimplemented [] 2025-12-04T09:58:54.2710780Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2710880Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2711455Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2711493Z graph_break [] 2025-12-04T09:58:54.2711566Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2711628Z Autotune Choices Stats: 2025-12-04T09:58:54.2712366Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.2712495Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2712609Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2712781Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2713391Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2714008Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2714610Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2715223Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2715826Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2716551Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2717152Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2717765Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2718384Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2718983Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2719110Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.2719150Z Autotune Choices Stats: 2025-12-04T09:58:54.2719920Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.2720138Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2720314Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2720589Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2721221Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2721847Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2722481Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2723101Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2723747Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2724375Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2725004Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2725632Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2726305Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2726949Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2727077Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.2727157Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2727198Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2727236Z unimplemented [] 2025-12-04T09:58:54.2727295Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2727397Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2727990Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2728031Z graph_break [] 2025-12-04T09:58:54.2728105Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2728144Z Autotune Choices Stats: 2025-12-04T09:58:54.2728897Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.2729025Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2729140Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2729302Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2729915Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2730544Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2731156Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2731756Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2732375Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2732983Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2733587Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2734189Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2734800Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2735411Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2735540Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.2735581Z Autotune Choices Stats: 2025-12-04T09:58:54.2736393Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.2736613Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2736777Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2737056Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2737698Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2738326Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2738962Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2739601Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2740229Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2740872Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2741495Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2742136Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2742766Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2743396Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2743533Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.2743608Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2743650Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2743692Z unimplemented [] 2025-12-04T09:58:54.2743752Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2743853Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2744438Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2744476Z graph_break [] 2025-12-04T09:58:54.2744551Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2744593Z Autotune Choices Stats: 2025-12-04T09:58:54.2745344Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.2745471Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2745595Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2745756Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2746418Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2747020Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2747632Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2748242Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2748847Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2749461Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2750077Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2750684Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2751286Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2751903Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2752041Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.2752081Z Autotune Choices Stats: 2025-12-04T09:58:54.2752847Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.2753063Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2753237Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2753514Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2754146Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2754787Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2755410Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2756085Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2756730Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2757354Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2757986Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2758625Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2759254Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2759876Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2760018Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.2760091Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2760134Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2760171Z unimplemented [] 2025-12-04T09:58:54.2760245Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2760345Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2760925Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2760962Z graph_break [] 2025-12-04T09:58:54.2761036Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2761075Z Autotune Choices Stats: 2025-12-04T09:58:54.2761823Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.2761952Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2762064Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2762224Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2762851Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2763453Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2764049Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2764661Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2765271Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2765875Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2766553Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2767172Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2767772Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2768372Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2768519Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.2768558Z Autotune Choices Stats: 2025-12-04T09:58:54.2769314Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.2769553Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2769716Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2769993Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2770631Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2771276Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2771899Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2772524Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2773165Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2773803Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2774429Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2775065Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2775698Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2776354Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2776483Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.2776556Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2776599Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2776658Z unimplemented [] 2025-12-04T09:58:54.2776721Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2776820Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2777399Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2777455Z graph_break [] 2025-12-04T09:58:54.2777528Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2777567Z Autotune Choices Stats: 2025-12-04T09:58:54.2778307Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.2778436Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2778548Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2778731Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2779337Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2779953Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2780553Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2781161Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2781767Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2782378Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2782997Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2783598Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2784208Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2784809Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2784938Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.2784978Z Autotune Choices Stats: 2025-12-04T09:58:54.2785730Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.2786000Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2786164Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2786439Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2787073Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2787716Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2788351Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2788976Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2789601Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2790236Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2790870Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2791501Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2792139Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2792772Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2792901Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.2792974Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2793018Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2793054Z unimplemented [] 2025-12-04T09:58:54.2793116Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2793214Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2793784Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2793832Z graph_break [] 2025-12-04T09:58:54.2793904Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2793945Z Autotune Choices Stats: 2025-12-04T09:58:54.2794687Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.2794825Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2794938Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2795098Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2795722Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2796357Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2796980Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2797580Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2798183Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2798799Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2799411Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2800026Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2800620Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2801242Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2801372Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.2801412Z Autotune Choices Stats: 2025-12-04T09:58:54.2802167Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.2802393Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2802557Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2802852Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2803486Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2804112Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2804759Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2805390Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2806041Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2806669Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2807314Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2807954Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2808593Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2809213Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2809341Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.2809415Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2809468Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2809506Z unimplemented [] 2025-12-04T09:58:54.2809568Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2809668Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2810239Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2810279Z graph_break [] 2025-12-04T09:58:54.2810353Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2810396Z Autotune Choices Stats: 2025-12-04T09:58:54.2811144Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.2811282Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2811395Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2811559Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2812171Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2812785Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2813388Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2813997Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2814602Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2815205Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2815816Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2816437Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2817056Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2817658Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2817788Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.2817829Z Autotune Choices Stats: 2025-12-04T09:58:54.2818602Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.2818821Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2818986Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2819278Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2819909Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2820548Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2821184Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2821800Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2822439Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2823066Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2823692Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2824327Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2824973Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2825604Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2825733Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.2825809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2825850Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2825887Z unimplemented [] 2025-12-04T09:58:54.2825963Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2826063Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2826656Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2826694Z graph_break [] 2025-12-04T09:58:54.2826769Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2826810Z Autotune Choices Stats: 2025-12-04T09:58:54.2827548Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.2827688Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2827800Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2827961Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2828589Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2829191Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2829805Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2830405Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2831019Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2831622Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2832231Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2832849Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2833450Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2834062Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2834192Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.2834233Z Autotune Choices Stats: 2025-12-04T09:58:54.2835006Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.2835225Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2835390Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2835667Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2836336Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2837074Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2837706Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2838339Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2838970Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2839613Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2840235Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2840859Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2841509Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2842135Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2842265Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.2842340Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2842381Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2842420Z unimplemented [] 2025-12-04T09:58:54.2842491Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2842593Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2843174Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2843217Z graph_break [] 2025-12-04T09:58:54.2843291Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2843330Z Autotune Choices Stats: 2025-12-04T09:58:54.2844072Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.2844199Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2844313Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2844474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2845098Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2845708Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2846346Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2846962Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2847563Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2848177Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2848781Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2849398Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2850011Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2850614Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2850743Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.2850783Z Autotune Choices Stats: 2025-12-04T09:58:54.2851557Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.2851774Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2851952Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2852230Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2852859Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2853493Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2854136Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2854754Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2855394Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2856057Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2856696Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2857327Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2857968Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2858602Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2858730Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.2858805Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2858847Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2858886Z unimplemented [] 2025-12-04T09:58:54.2858948Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2859047Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2859626Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2859665Z graph_break [] 2025-12-04T09:58:54.2859739Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2859778Z Autotune Choices Stats: 2025-12-04T09:58:54.2860523Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.2860651Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2860765Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2860926Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2861540Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2862157Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2862772Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2863377Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2863989Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2864603Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2865217Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2865826Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2866475Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2867103Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2867230Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.2867271Z Autotune Choices Stats: 2025-12-04T09:58:54.2868039Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.2868256Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2868420Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2868699Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2869347Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2869972Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2870607Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2871241Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2871870Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2872507Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2873127Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2873772Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2874398Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2875035Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2875173Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.2875248Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2875293Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2875331Z unimplemented [] 2025-12-04T09:58:54.2875392Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2875489Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2876101Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2876138Z graph_break [] 2025-12-04T09:58:54.2876214Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2876255Z Autotune Choices Stats: 2025-12-04T09:58:54.2877011Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.2877144Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2877256Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2877433Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2878036Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2878641Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2879267Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2879883Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2880487Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2881103Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2881716Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2882319Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2882922Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2883543Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2883682Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.2883724Z Autotune Choices Stats: 2025-12-04T09:58:54.2884486Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.2884700Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2884876Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2885157Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2885791Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2886467Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2887100Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2887736Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2888377Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2889004Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2889646Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2890285Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2890911Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2891542Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2891681Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.2891754Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2891798Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2891834Z unimplemented [] 2025-12-04T09:58:54.2891904Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2892003Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2892577Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2892613Z graph_break [] 2025-12-04T09:58:54.2892687Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2892727Z Autotune Choices Stats: 2025-12-04T09:58:54.2893480Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.2893608Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2893719Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2893882Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2894504Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2895110Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2895715Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2896362Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2896981Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2897588Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2898202Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2898828Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2899436Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2900041Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2900185Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.2900224Z Autotune Choices Stats: 2025-12-04T09:58:54.2900985Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.2901210Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2901380Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2901659Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2902299Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2902933Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2903565Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2904190Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2904826Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2905467Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2906124Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2906762Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2907398Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2908023Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2908153Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.2908246Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.2908308Z Traceback (most recent call last): 2025-12-04T09:58:54.2908463Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.2908506Z self.assertTrue( 2025-12-04T09:58:54.2908612Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.2908663Z raise self.failureException(msg) 2025-12-04T09:58:54.2908791Z AssertionError: False is not true : Log file /tmp/tmpxgf0yi50/flex_attention_configs.json was not created 2025-12-04T09:58:54.2908809Z 2025-12-04T09:58:54.2908885Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.2909051Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.2909053Z 2025-12-04T09:58:54.2909144Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.2909225Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2909268Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2909310Z unimplemented [] 2025-12-04T09:58:54.2909373Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2909947Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.2910047Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2910085Z graph_break [] 2025-12-04T09:58:54.2910161Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2910662Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.2910710Z current_size = base.storage().size() 2025-12-04T09:58:54.2910754Z Autotune Choices Stats: 2025-12-04T09:58:54.2911516Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.2911646Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2911760Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2911920Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2912532Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2913151Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2913761Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2914360Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2914978Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2915588Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2916222Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2916821Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2917437Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2918053Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2918183Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.2918226Z Autotune Choices Stats: 2025-12-04T09:58:54.2918989Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.2919211Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2919383Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2919659Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2920311Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2920934Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2921568Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2922197Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2922822Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2923458Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2924088Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2924714Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2925342Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2926013Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2926153Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.2926231Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2926275Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2926316Z unimplemented [] 2025-12-04T09:58:54.2926379Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2926482Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2927059Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2927098Z graph_break [] 2025-12-04T09:58:54.2927174Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2927218Z Autotune Choices Stats: 2025-12-04T09:58:54.2927965Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.2928093Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2928222Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2928388Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2928996Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2929610Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2930223Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2930844Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2931448Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2932061Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2932672Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2933274Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2933879Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2934488Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2934623Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.2934665Z Autotune Choices Stats: 2025-12-04T09:58:54.2935423Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.2935640Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2935816Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2936122Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2936773Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2937400Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2938024Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2938657Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2939294Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2939919Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2940553Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2941189Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2941810Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2942436Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2942574Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.2942650Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2942694Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2942741Z unimplemented [] 2025-12-04T09:58:54.2942806Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2942906Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2943553Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.2943591Z graph_break [] 2025-12-04T09:58:54.2943666Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2943707Z Autotune Choices Stats: 2025-12-04T09:58:54.2944455Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.2944585Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2944699Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2944859Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2945480Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2946111Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2946715Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2947334Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2947949Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2948550Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2949176Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2949788Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2950387Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2950997Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2951136Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.2951177Z Autotune Choices Stats: 2025-12-04T09:58:54.2951939Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.2952164Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2952329Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2952607Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2953262Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2953903Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2954528Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2955156Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2955789Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2956457Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2957083Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2957723Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2958357Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2958980Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2959108Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.2959183Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2959228Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2959278Z unimplemented [] 2025-12-04T09:58:54.2959339Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2959439Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2960017Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2960069Z graph_break [] 2025-12-04T09:58:54.2960144Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2960183Z Autotune Choices Stats: 2025-12-04T09:58:54.2960931Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.2961060Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2961172Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2961347Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2961957Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2962577Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2963181Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2963785Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2964399Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2965013Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2965621Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2966267Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2966872Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2967476Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2967606Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.2967646Z Autotune Choices Stats: 2025-12-04T09:58:54.2968402Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.2968644Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2968810Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2969090Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2969722Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2970355Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2970991Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2971606Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2972233Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2972865Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2973500Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2974132Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2974765Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2975407Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2975538Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.2975611Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2975654Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2975691Z unimplemented [] 2025-12-04T09:58:54.2975754Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2975852Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2976470Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2976525Z graph_break [] 2025-12-04T09:58:54.2976598Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2976640Z Autotune Choices Stats: 2025-12-04T09:58:54.2977383Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.2977527Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2977640Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2977801Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2978433Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2979041Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2979654Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2980256Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2980859Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2981475Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2982091Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2982708Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2983311Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2983926Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2984057Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.2984097Z Autotune Choices Stats: 2025-12-04T09:58:54.2984855Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.2985081Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2985246Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2985533Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2986197Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2986822Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2987464Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2988099Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2988727Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2989359Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2989990Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2990623Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2991260Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2991885Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2992015Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.2992089Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.2992141Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.2992180Z unimplemented [] 2025-12-04T09:58:54.2992244Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.2992343Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.2992926Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.2992964Z graph_break [] 2025-12-04T09:58:54.2993037Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.2993080Z Autotune Choices Stats: 2025-12-04T09:58:54.2993832Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.2993980Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.2994095Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.2994258Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.2994870Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2995491Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2996123Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2996744Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2997350Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2997957Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2998570Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.2999187Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.2999801Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3000393Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3000520Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.3000563Z Autotune Choices Stats: 2025-12-04T09:58:54.3001330Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.3001547Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3001712Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3002000Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3002631Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3003266Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3003900Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3004525Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3005165Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3005784Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3006443Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3007094Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3007729Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3008360Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3008492Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.3008568Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3008611Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3008650Z unimplemented [] 2025-12-04T09:58:54.3008711Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3008811Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3009400Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3009441Z graph_break [] 2025-12-04T09:58:54.3009514Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3009557Z Autotune Choices Stats: 2025-12-04T09:58:54.3010293Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.3010439Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3010553Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3010712Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3011334Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3011940Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3012556Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3013165Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3013775Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3014379Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3014998Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3015615Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3016249Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3016868Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3016997Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.3017037Z Autotune Choices Stats: 2025-12-04T09:58:54.3017808Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.3018025Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3018191Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3018469Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3019109Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3019749Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3020382Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3021018Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3021652Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3022290Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3022909Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3023548Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3024207Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3024832Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3024960Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.3025037Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3025079Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3025121Z unimplemented [] 2025-12-04T09:58:54.3025191Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3025293Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3025869Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3025908Z graph_break [] 2025-12-04T09:58:54.3026015Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3026058Z Autotune Choices Stats: 2025-12-04T09:58:54.3026811Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.3026939Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3027054Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3027216Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3027834Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3028451Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3029053Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3029667Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3030270Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3030884Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3031490Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3032104Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3032716Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3033321Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3033451Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.3033493Z Autotune Choices Stats: 2025-12-04T09:58:54.3034263Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.3034481Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3034657Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3034937Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3035573Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3036234Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3036882Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3037510Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3038151Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3038775Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3039410Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3040037Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3040673Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3041306Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3041435Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.3041511Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3041554Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3041594Z unimplemented [] 2025-12-04T09:58:54.3041657Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3041759Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3042355Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3042394Z graph_break [] 2025-12-04T09:58:54.3042471Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3042511Z Autotune Choices Stats: 2025-12-04T09:58:54.3043260Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.3043390Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3043507Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3043670Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3044281Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3044890Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3045510Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3046144Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3046763Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3047367Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3047987Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3048590Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3049206Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3049826Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3049955Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.3049998Z Autotune Choices Stats: 2025-12-04T09:58:54.3050765Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.3050984Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3051153Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3051433Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3052075Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3052702Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3053335Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3053968Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3054595Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3055233Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3055865Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3056535Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3057161Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3057797Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3057945Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.3058020Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3058066Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3058104Z unimplemented [] 2025-12-04T09:58:54.3058170Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3058268Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3058854Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3058892Z graph_break [] 2025-12-04T09:58:54.3058968Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3059007Z Autotune Choices Stats: 2025-12-04T09:58:54.3059768Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.3059900Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3060014Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3060187Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3060795Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3061399Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3062010Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3062630Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3063233Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3063846Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3064453Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3065055Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3065664Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3066310Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3066454Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.3066495Z Autotune Choices Stats: 2025-12-04T09:58:54.3067254Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.3067473Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3067654Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3067937Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3068566Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3069196Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3069825Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3070462Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3071097Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3071729Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3072361Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3072997Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3073624Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3074255Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3074401Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.3074474Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3074519Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3074556Z unimplemented [] 2025-12-04T09:58:54.3074620Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3074730Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3075308Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3075346Z graph_break [] 2025-12-04T09:58:54.3075425Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3075465Z Autotune Choices Stats: 2025-12-04T09:58:54.3076256Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.3076386Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3076498Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3076661Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3077289Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3077896Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3078499Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3079111Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3079721Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3080321Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3080937Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3081554Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3082158Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3082760Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3082899Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.3082939Z Autotune Choices Stats: 2025-12-04T09:58:54.3083700Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.3083928Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3084093Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3084372Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3085016Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3085656Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3086313Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3086937Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3087570Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3088209Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3088832Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3089474Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3090120Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3090744Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3090877Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.3090951Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3090994Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3091043Z unimplemented [] 2025-12-04T09:58:54.3091107Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3091206Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3091791Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3091843Z graph_break [] 2025-12-04T09:58:54.3091915Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3091958Z Autotune Choices Stats: 2025-12-04T09:58:54.3092698Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.3092831Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3092945Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3093119Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3093739Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3094346Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3094949Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3095559Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3096205Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3096823Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3097422Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3098051Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3098663Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3099262Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3099394Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.3099435Z Autotune Choices Stats: 2025-12-04T09:58:54.3100204Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.3100439Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3100605Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3100881Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3101509Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3102148Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3102780Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3103403Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3104037Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3104669Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3105296Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3105959Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3106605Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3107239Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3107372Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.3107450Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3107494Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3107533Z unimplemented [] 2025-12-04T09:58:54.3107595Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3107696Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3108275Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3108328Z graph_break [] 2025-12-04T09:58:54.3108401Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3108442Z Autotune Choices Stats: 2025-12-04T09:58:54.3109171Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.3109311Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3109426Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3109586Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3110210Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3110815Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3111428Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3112035Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3112636Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3113246Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3113861Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3114478Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3115078Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3115693Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3115824Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.3115868Z Autotune Choices Stats: 2025-12-04T09:58:54.3116661Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.3116897Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3117063Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3117356Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3117993Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3118620Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3119256Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3119892Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3120523Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3121150Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3121780Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3122425Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3123060Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3123682Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3123810Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.3123885Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3123927Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3123982Z unimplemented [] 2025-12-04T09:58:54.3124048Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3124149Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3124725Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3124764Z graph_break [] 2025-12-04T09:58:54.3124837Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3124879Z Autotune Choices Stats: 2025-12-04T09:58:54.3125619Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.3125765Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3125884Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3126092Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3126706Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3127330Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3127933Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3128549Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3129155Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3129759Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3130373Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3130994Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3131603Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3132206Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3132334Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.3132377Z Autotune Choices Stats: 2025-12-04T09:58:54.3133140Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.3133358Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3133526Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3133813Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3134454Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3135095Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3135726Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3136391Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3137038Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3137666Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3138288Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3138935Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3139573Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3140211Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3140341Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.3140417Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3140459Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3140498Z unimplemented [] 2025-12-04T09:58:54.3140558Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3140661Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3141246Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3141287Z graph_break [] 2025-12-04T09:58:54.3141363Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3141403Z Autotune Choices Stats: 2025-12-04T09:58:54.3142139Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.3142279Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3142395Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3142556Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3143175Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3143783Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3144405Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3145006Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3145624Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3146268Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3146868Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3147483Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3148096Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3148715Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3148844Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.3148886Z Autotune Choices Stats: 2025-12-04T09:58:54.3149655Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.3149874Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3150044Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3150325Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3150963Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3151599Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3152235Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3152870Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3153499Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3154140Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3154764Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3155389Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3156060Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3156703Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3156830Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.3156905Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3156947Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3156985Z unimplemented [] 2025-12-04T09:58:54.3157061Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3157161Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3157730Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3157767Z graph_break [] 2025-12-04T09:58:54.3157841Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3157881Z Autotune Choices Stats: 2025-12-04T09:58:54.3158628Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.3158757Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3158870Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3159031Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3159652Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3160269Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3160873Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3161481Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3162078Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3162694Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3163298Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3163906Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3164522Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3165125Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3165254Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.3165294Z Autotune Choices Stats: 2025-12-04T09:58:54.3166103Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.3166318Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3166504Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3166778Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3167405Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3168031Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3168671Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3169301Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3169944Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3170572Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3171205Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3171830Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3172466Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3173102Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3173231Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.3173305Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3173347Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3173383Z unimplemented [] 2025-12-04T09:58:54.3173445Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3173544Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3174130Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3174169Z graph_break [] 2025-12-04T09:58:54.3174244Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3174284Z Autotune Choices Stats: 2025-12-04T09:58:54.3175033Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.3175161Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3175274Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3175435Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3176081Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3176698Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3178789Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3179392Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3180020Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3180628Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3181246Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3181850Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3182466Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3183076Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3183206Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.3183247Z Autotune Choices Stats: 2025-12-04T09:58:54.3184000Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.3184229Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3184394Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3184675Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3185322Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3185999Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3186638Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3187279Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3187905Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3188549Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3189175Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3189807Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3190436Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3191070Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3191209Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.3191283Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3191325Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3191363Z unimplemented [] 2025-12-04T09:58:54.3191426Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3191527Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3192106Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3192143Z graph_break [] 2025-12-04T09:58:54.3192218Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3192257Z Autotune Choices Stats: 2025-12-04T09:58:54.3193001Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.3193131Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3193245Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3193417Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3194030Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3194635Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3195245Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3195856Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3196496Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3197115Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3197729Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3198331Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3198930Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3199544Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3199685Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.3199724Z Autotune Choices Stats: 2025-12-04T09:58:54.3200486Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.3200705Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3200870Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3201159Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3201785Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3202424Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3203047Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3203685Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3204324Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3204951Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3205585Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3206254Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3206887Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3207515Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3207653Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.3207727Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3207771Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3207808Z unimplemented [] 2025-12-04T09:58:54.3207871Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3207986Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3208567Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3208606Z graph_break [] 2025-12-04T09:58:54.3208680Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3208721Z Autotune Choices Stats: 2025-12-04T09:58:54.3209478Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.3209609Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3209723Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3209884Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3210508Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3211107Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3211711Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3212331Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3212943Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3213542Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3214153Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3214766Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3215367Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3216007Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3216147Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.3216188Z Autotune Choices Stats: 2025-12-04T09:58:54.3216951Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.3217186Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3217351Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3217633Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3218280Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3218895Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3219540Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3220161Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3220802Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3221436Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3222051Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3222693Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3223328Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3223952Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3224081Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.3224155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3224197Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3224233Z unimplemented [] 2025-12-04T09:58:54.3224305Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3224405Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3224980Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3225027Z graph_break [] 2025-12-04T09:58:54.3225101Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3225141Z Autotune Choices Stats: 2025-12-04T09:58:54.3225873Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.3226038Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3226156Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3226316Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3226948Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3227562Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3228166Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3228769Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3229381Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3229997Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3230606Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3231220Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3231830Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3232432Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3232564Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.3232607Z Autotune Choices Stats: 2025-12-04T09:58:54.3233364Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.3233600Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3233765Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3234048Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3234687Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3235327Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3235992Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3236620Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3237248Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3237886Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3238520Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3239150Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3239787Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3240418Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3240546Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.3240621Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3240663Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3240703Z unimplemented [] 2025-12-04T09:58:54.3240764Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3240867Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3241438Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3241487Z graph_break [] 2025-12-04T09:58:54.3241561Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3241603Z Autotune Choices Stats: 2025-12-04T09:58:54.3242344Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.3242481Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3242596Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3242756Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3243380Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3243984Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3244602Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3245203Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3245807Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3246444Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3247060Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3247661Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3248273Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3248893Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3249023Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.3249064Z Autotune Choices Stats: 2025-12-04T09:58:54.3249818Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.3250049Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3250213Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3250505Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3251139Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3251763Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3252394Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3253031Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3253657Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3254284Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3254921Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3255552Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3256210Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3256853Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3256981Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.3257055Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3257097Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3257136Z unimplemented [] 2025-12-04T09:58:54.3257214Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3257317Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3257889Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3257928Z graph_break [] 2025-12-04T09:58:54.3258002Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3258042Z Autotune Choices Stats: 2025-12-04T09:58:54.3260563Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.3260726Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3260842Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3261001Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3261631Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3262233Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3262841Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3263445Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3264051Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3264659Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3265332Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3265983Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3266586Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3267200Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3267331Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.3267372Z Autotune Choices Stats: 2025-12-04T09:58:54.3268134Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.3268356Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3268524Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3268824Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3269498Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3270132Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3270764Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3271390Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3272019Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3272650Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3273278Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3273940Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3274569Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3275198Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3275329Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.3275405Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3275450Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3275488Z unimplemented [] 2025-12-04T09:58:54.3275552Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3275654Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3276264Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3276305Z graph_break [] 2025-12-04T09:58:54.3276379Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3276420Z Autotune Choices Stats: 2025-12-04T09:58:54.3277169Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.3277313Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3277430Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3277588Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3278235Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3278838Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3279438Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3280045Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3280647Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3281261Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3281866Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3282498Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3283123Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3283726Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3283856Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.3283897Z Autotune Choices Stats: 2025-12-04T09:58:54.3284656Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.3284877Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3285040Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3285317Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3285982Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3286651Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3287289Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3287913Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3288539Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3289178Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3289800Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3290427Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3291084Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3291845Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3291973Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.3292048Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3292090Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3292128Z unimplemented [] 2025-12-04T09:58:54.3292189Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3292290Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3292862Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3292901Z graph_break [] 2025-12-04T09:58:54.3292975Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3293016Z Autotune Choices Stats: 2025-12-04T09:58:54.3293761Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.3293889Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3294004Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3294163Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3294807Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3295423Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3296062Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3296661Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3297273Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3297877Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3298479Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3299105Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3299760Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3300365Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3300495Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.3300536Z Autotune Choices Stats: 2025-12-04T09:58:54.3301295Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.3301514Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3301677Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3301950Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3302583Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3303211Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3303862Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3304497Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3305126Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3305757Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3306396Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3307024Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3307695Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3308332Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3308462Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.3308538Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3308581Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3308620Z unimplemented [] 2025-12-04T09:58:54.3308682Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3308786Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3309363Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3309404Z graph_break [] 2025-12-04T09:58:54.3309480Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3309521Z Autotune Choices Stats: 2025-12-04T09:58:54.3310259Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.3310389Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3310504Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3310669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3311283Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3311914Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3312526Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3313133Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3313749Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3314353Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3314954Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3315564Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3316258Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3316873Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3317002Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.3317045Z Autotune Choices Stats: 2025-12-04T09:58:54.3317818Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.3318035Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3318203Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3318483Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3319121Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3319759Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3320416Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3321051Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3321686Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3322316Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3322942Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3323575Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3324203Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3324859Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3324994Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.3325072Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3325114Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3325154Z unimplemented [] 2025-12-04T09:58:54.3325215Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3325316Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3325900Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3325977Z graph_break [] 2025-12-04T09:58:54.3326054Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3326094Z Autotune Choices Stats: 2025-12-04T09:58:54.3326833Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:54.3326962Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3327079Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3327242Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3327857Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3328465Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3329112Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3329735Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3330342Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3330944Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3331553Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3332162Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3332765Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3333421Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3333561Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:54.3333602Z Autotune Choices Stats: 2025-12-04T09:58:54.3334363Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.3334580Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3334747Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3335025Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3335665Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3336332Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3336954Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3337621Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3338262Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3338890Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3339515Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3340145Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3340776Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3341403Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3341543Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:54.3341636Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.3341685Z Traceback (most recent call last): 2025-12-04T09:58:54.3341857Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.3341909Z self.assertTrue( 2025-12-04T09:58:54.3342015Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.3342066Z raise self.failureException(msg) 2025-12-04T09:58:54.3342195Z AssertionError: False is not true : Log file /tmp/tmprdkgqj0a/flex_attention_configs.json was not created 2025-12-04T09:58:54.3342200Z 2025-12-04T09:58:54.3342279Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.3342445Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.3342448Z 2025-12-04T09:58:54.3342540Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.3342616Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3342660Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3342701Z unimplemented [] 2025-12-04T09:58:54.3342765Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3343348Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.3343451Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3343493Z graph_break [] 2025-12-04T09:58:54.3343567Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3344063Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.3344113Z current_size = base.storage().size() 2025-12-04T09:58:54.3344156Z Autotune Choices Stats: 2025-12-04T09:58:54.3344899Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.3345031Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3345149Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3345321Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3345993Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3346615Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3347223Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3347824Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3348428Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3349026Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3349636Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3350271Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3350878Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3351481Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3351616Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.3351660Z Autotune Choices Stats: 2025-12-04T09:58:54.3352417Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.3352639Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3352804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3353082Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3353714Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3354337Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3354996Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3355627Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3356295Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3356919Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3357546Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3358175Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3359067Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3359698Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3359832Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.3359911Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3359953Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3359994Z unimplemented [] 2025-12-04T09:58:54.3360058Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3360158Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3360735Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3360775Z graph_break [] 2025-12-04T09:58:54.3360850Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3360894Z Autotune Choices Stats: 2025-12-04T09:58:54.3361636Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.3361768Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3361886Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3362050Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3362663Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3363311Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3363926Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3364528Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3365132Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3365737Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3366398Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3367002Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3367647Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3368265Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3368396Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.3368440Z Autotune Choices Stats: 2025-12-04T09:58:54.3369191Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.3369409Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3369578Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3369861Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3370498Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3371123Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3371881Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3372513Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3373143Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3373768Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3374394Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3375015Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3375643Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3376347Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3376487Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.3376564Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3376606Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3376647Z unimplemented [] 2025-12-04T09:58:54.3376711Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3376815Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3377393Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3377435Z graph_break [] 2025-12-04T09:58:54.3377509Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3377551Z Autotune Choices Stats: 2025-12-04T09:58:54.3378296Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.3378429Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3378546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3378706Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3379324Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3379930Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3380566Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3381177Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3381783Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3382394Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3382997Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3383601Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3384209Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3384841Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3384979Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.3385023Z Autotune Choices Stats: 2025-12-04T09:58:54.3385775Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.3386023Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3386195Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3386474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3387111Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3387742Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3388368Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3389031Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3389673Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3390307Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3390929Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3391556Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3392185Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3392815Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3392956Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.3393033Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3393076Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3393115Z unimplemented [] 2025-12-04T09:58:54.3393199Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3393310Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3393886Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3393925Z graph_break [] 2025-12-04T09:58:54.3394001Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3394041Z Autotune Choices Stats: 2025-12-04T09:58:54.3394786Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.3394915Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3395031Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3395193Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3395805Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3396437Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3397042Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3397685Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3398311Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3398918Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3399521Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3400126Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3400736Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3401342Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3401486Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.3401530Z Autotune Choices Stats: 2025-12-04T09:58:54.3402306Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.3402535Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3402707Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3402986Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3403617Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3404245Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3404880Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3405500Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3406202Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3406844Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3407471Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3408096Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3408729Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3409357Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3409489Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.3409564Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3409609Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3409647Z unimplemented [] 2025-12-04T09:58:54.3409736Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3409836Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3410434Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3410492Z graph_break [] 2025-12-04T09:58:54.3410567Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3410608Z Autotune Choices Stats: 2025-12-04T09:58:54.3411366Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.3411497Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3411611Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3411774Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3412390Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3413000Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3413605Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3414207Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3414844Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3415456Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3416093Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3416696Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3417305Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3417909Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3418041Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.3418082Z Autotune Choices Stats: 2025-12-04T09:58:54.3418839Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.3419112Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3419280Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3419562Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3420192Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3420824Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3421447Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3422073Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3422699Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3423352Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3423987Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3424615Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3425240Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3425866Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3426029Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.3426106Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3426153Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3426191Z unimplemented [] 2025-12-04T09:58:54.3426257Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3426359Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3426932Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3426992Z graph_break [] 2025-12-04T09:58:54.3427070Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3427111Z Autotune Choices Stats: 2025-12-04T09:58:54.3427881Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.3428025Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3428140Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3428307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3428916Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3429525Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3430128Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3430745Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3431353Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3431990Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3432603Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3433207Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3433809Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3434412Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3434546Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.3434589Z Autotune Choices Stats: 2025-12-04T09:58:54.3435363Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.3435596Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3435760Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3436105Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3436744Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3437376Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3438003Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3438643Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3439276Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3439906Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3440571Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3441212Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3441839Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3442469Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3442602Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.3442678Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3442726Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3442764Z unimplemented [] 2025-12-04T09:58:54.3442829Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3442929Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3443507Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3443549Z graph_break [] 2025-12-04T09:58:54.3443623Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3443668Z Autotune Choices Stats: 2025-12-04T09:58:54.3444430Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.3444584Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3444700Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3444864Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3445478Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3446138Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3446744Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3447344Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3447951Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3448557Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3449210Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3449826Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3450427Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3451040Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3451173Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.3451215Z Autotune Choices Stats: 2025-12-04T09:58:54.3451978Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.3452203Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3452368Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3452665Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3453313Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3453946Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3454575Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3455209Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3455838Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3456506Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3457134Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3457803Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3458443Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3459068Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3459200Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.3459274Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3459322Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3459359Z unimplemented [] 2025-12-04T09:58:54.3459425Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3459528Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3460107Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3460149Z graph_break [] 2025-12-04T09:58:54.3460224Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3460270Z Autotune Choices Stats: 2025-12-04T09:58:54.3461012Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.3461160Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3461278Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3461441Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3462087Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3462692Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3463300Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3463908Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3464512Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3465118Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3465733Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3466403Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3467017Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3467625Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3467759Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.3467802Z Autotune Choices Stats: 2025-12-04T09:58:54.3468550Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.3468770Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3468932Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3469217Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3469844Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3470499Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3471137Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3471763Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3472392Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3473018Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3473653Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3474287Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3474943Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3475587Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3475724Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.3475805Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3475849Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3475893Z unimplemented [] 2025-12-04T09:58:54.3475994Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3476098Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3476674Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3476714Z graph_break [] 2025-12-04T09:58:54.3476788Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3476835Z Autotune Choices Stats: 2025-12-04T09:58:54.3477584Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.3477716Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3477835Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3477997Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3478657Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3479271Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3479877Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3480479Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3481087Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3481697Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3482301Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3482916Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3483550Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3484160Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3484290Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.3484337Z Autotune Choices Stats: 2025-12-04T09:58:54.3485098Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.3485322Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3485489Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3485766Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3486430Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3487056Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3487712Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3488351Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3488979Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3489611Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3490242Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3490871Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3491496Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3492164Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3492292Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.3492369Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3492410Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3492450Z unimplemented [] 2025-12-04T09:58:54.3492510Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3492615Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3493192Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3493233Z graph_break [] 2025-12-04T09:58:54.3493307Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3493349Z Autotune Choices Stats: 2025-12-04T09:58:54.3494094Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.3494223Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3494340Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3494500Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3495115Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3496266Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3496879Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3497486Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3498092Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3498699Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3499304Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3499907Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3500543Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3501159Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3501292Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.3501337Z Autotune Choices Stats: 2025-12-04T09:58:54.3502087Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.3502308Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3502474Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3502754Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3503394Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3504032Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3504660Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3505318Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3505982Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3506611Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3507238Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3507866Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3508491Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3509163Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3509305Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.3509385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3509429Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3509472Z unimplemented [] 2025-12-04T09:58:54.3509533Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3509640Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3510213Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3510253Z graph_break [] 2025-12-04T09:58:54.3510331Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3510372Z Autotune Choices Stats: 2025-12-04T09:58:54.3511121Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.3511251Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3511368Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3511531Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3512143Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3512752Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3513391Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3513998Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3514601Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3515204Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3515812Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3516439Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3517047Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3517697Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3517839Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.3517883Z Autotune Choices Stats: 2025-12-04T09:58:54.3518637Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.3518856Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3519024Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3519306Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3519946Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3520571Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3521199Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3521848Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3522486Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3523115Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3523824Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3524454Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3525082Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3525709Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3525863Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.3525980Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3526026Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3526063Z unimplemented [] 2025-12-04T09:58:54.3526129Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3526277Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3526854Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3526893Z graph_break [] 2025-12-04T09:58:54.3526971Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3527014Z Autotune Choices Stats: 2025-12-04T09:58:54.3527753Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.3527887Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3528003Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3528166Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3528778Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3529384Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3529995Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3530637Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3531247Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3531858Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3532462Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3533065Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3533674Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3534284Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3534430Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.3534471Z Autotune Choices Stats: 2025-12-04T09:58:54.3535246Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.3535473Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3535642Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3535961Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3536591Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3537219Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3537846Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3538475Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3539179Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3539830Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3540459Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3541092Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3541718Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3542353Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3542488Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.3542563Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3542610Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3542649Z unimplemented [] 2025-12-04T09:58:54.3542715Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3542830Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3543423Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3543471Z graph_break [] 2025-12-04T09:58:54.3543549Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3543590Z Autotune Choices Stats: 2025-12-04T09:58:54.3544331Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.3544467Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3544581Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3544752Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3545362Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3546016Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3546620Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3547228Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3547877Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3548500Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3549108Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3549717Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3550325Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3550934Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3551068Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.3551110Z Autotune Choices Stats: 2025-12-04T09:58:54.3551863Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.3552119Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3552294Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3552576Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3553215Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3553840Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3554468Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3555099Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3555724Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3556446Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3557094Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3557726Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3558357Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3558985Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3559118Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.3559194Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3559239Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3559277Z unimplemented [] 2025-12-04T09:58:54.3559344Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3559447Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3560030Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3560088Z graph_break [] 2025-12-04T09:58:54.3560164Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3560211Z Autotune Choices Stats: 2025-12-04T09:58:54.3560968Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.3561111Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3561227Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3561389Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3562007Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3562614Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3563222Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3563861Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3564477Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3565110Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3565724Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3566368Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3566972Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3567574Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3567710Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.3567752Z Autotune Choices Stats: 2025-12-04T09:58:54.3568516Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.3568750Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3568915Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3569220Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3569863Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3570493Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3571119Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3571743Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3572378Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3573005Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3573668Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3574311Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3574945Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3575570Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3575703Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.3575781Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3575826Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3575865Z unimplemented [] 2025-12-04T09:58:54.3575969Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3576072Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3576654Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3576697Z graph_break [] 2025-12-04T09:58:54.3576773Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3576819Z Autotune Choices Stats: 2025-12-04T09:58:54.3577549Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.3577734Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3577857Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3578023Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3578644Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3579251Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3579859Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3580472Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3581088Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3581691Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3582326Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3582945Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3583552Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3584157Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3584290Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.3584336Z Autotune Choices Stats: 2025-12-04T09:58:54.3585100Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.3585322Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3585489Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3585781Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3586494Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3587134Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3587762Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3588392Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3589026Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3589653Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3590279Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3590952Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3591608Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3592234Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3592367Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.3592449Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3592493Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3592538Z unimplemented [] 2025-12-04T09:58:54.3592602Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3592710Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3593295Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3593338Z graph_break [] 2025-12-04T09:58:54.3593412Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3593458Z Autotune Choices Stats: 2025-12-04T09:58:54.3594195Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.3594343Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3594466Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3594629Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3595268Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3595883Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3596533Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3597133Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3597737Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3598347Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3598954Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3599602Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3600220Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3600827Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3600958Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.3601005Z Autotune Choices Stats: 2025-12-04T09:58:54.3601765Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.3601988Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3602157Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3602433Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3603077Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3603736Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3604373Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3605003Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3605636Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3606308Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3606938Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3607572Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3608235Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3608878Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3609009Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.3609089Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3609132Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3609175Z unimplemented [] 2025-12-04T09:58:54.3609237Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3609341Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3609915Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3609960Z graph_break [] 2025-12-04T09:58:54.3610039Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3610083Z Autotune Choices Stats: 2025-12-04T09:58:54.3610829Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.3610962Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3611080Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3611243Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3611890Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3612504Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3613112Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3613713Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3614320Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3614932Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3615537Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3616181Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3616829Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3617449Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3617578Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.3617624Z Autotune Choices Stats: 2025-12-04T09:58:54.3618379Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.3618603Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3618775Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3619059Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3619696Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3620326Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3620988Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3621625Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3622256Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3622885Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3623520Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3624144Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3624773Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3625433Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3625579Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.3625659Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3625703Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3625748Z unimplemented [] 2025-12-04T09:58:54.3625811Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3625919Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3626535Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3626574Z graph_break [] 2025-12-04T09:58:54.3626654Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3626696Z Autotune Choices Stats: 2025-12-04T09:58:54.3627446Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.3627575Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3627694Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3627860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3628473Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3629127Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3629738Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3630345Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3630948Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3631560Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3632170Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3632774Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3633392Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3634026Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3634156Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.3634201Z Autotune Choices Stats: 2025-12-04T09:58:54.3634964Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.3635183Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3635357Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3635642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3636307Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3636934Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3637562Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3638238Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3638882Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3639515Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3640145Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3640770Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3641397Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3642055Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3642199Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.3642275Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3642323Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3642364Z unimplemented [] 2025-12-04T09:58:54.3642430Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3642531Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3643110Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3643151Z graph_break [] 2025-12-04T09:58:54.3643231Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3643273Z Autotune Choices Stats: 2025-12-04T09:58:54.3644016Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.3644156Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3644271Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3644437Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3645050Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3645662Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3646343Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3646966Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3647574Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3648175Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3648779Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3649389Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3649990Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3650635Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3650778Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.3650821Z Autotune Choices Stats: 2025-12-04T09:58:54.3651576Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.3651795Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3651965Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3652245Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3652886Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3653516Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3654146Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3654813Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3655450Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3656112Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3656736Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3657371Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3657995Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3658623Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3658776Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.3658853Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3658902Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3658940Z unimplemented [] 2025-12-04T09:58:54.3659006Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3659132Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3659726Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3659767Z graph_break [] 2025-12-04T09:58:54.3659848Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3659890Z Autotune Choices Stats: 2025-12-04T09:58:54.3660633Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.3660766Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3660880Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3661045Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3661658Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3662270Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3662875Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3663511Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3664121Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3664730Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3665341Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3665986Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3666586Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3667197Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3667352Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.3667396Z Autotune Choices Stats: 2025-12-04T09:58:54.3668188Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.3668429Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3668594Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3668878Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3669514Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3670147Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3670771Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3671402Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3672065Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3672704Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3673338Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3673972Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3674600Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3675236Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3675371Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.3675446Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3675495Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3675533Z unimplemented [] 2025-12-04T09:58:54.3675600Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3675712Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3676361Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3676418Z graph_break [] 2025-12-04T09:58:54.3676493Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3676539Z Autotune Choices Stats: 2025-12-04T09:58:54.3677284Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.3677422Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3677538Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3677703Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3678317Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3678914Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3679528Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3680132Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3680775Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3681386Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3681992Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3682601Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3683205Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3683813Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3683948Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.3683989Z Autotune Choices Stats: 2025-12-04T09:58:54.3684780Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.3685038Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3685216Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3685513Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3686189Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3686844Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3687476Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3688104Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3688738Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3692416Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3693063Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3693694Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3694321Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3694952Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3695086Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.3695165Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3695212Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3695251Z unimplemented [] 2025-12-04T09:58:54.3695317Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3695421Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3696053Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3696114Z graph_break [] 2025-12-04T09:58:54.3696190Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3696234Z Autotune Choices Stats: 2025-12-04T09:58:54.3696998Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.3697144Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3697261Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3697427Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3698043Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3698648Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3699248Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3699853Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3700455Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3701092Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3701706Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3702314Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3702922Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3703526Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3703658Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.3703699Z Autotune Choices Stats: 2025-12-04T09:58:54.3704455Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.3704689Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3704857Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3705166Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3705805Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3706456Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3707089Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3707710Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3708343Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3708974Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3709644Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3710289Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3710922Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3711549Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3711682Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.3711759Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3711802Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3711842Z unimplemented [] 2025-12-04T09:58:54.3711903Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3712009Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3712581Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3712622Z graph_break [] 2025-12-04T09:58:54.3712697Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3712739Z Autotune Choices Stats: 2025-12-04T09:58:54.3713478Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.3713647Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3713775Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3713935Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3714557Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3715164Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3715767Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3716472Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3717074Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3717679Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3718343Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3718972Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3719574Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3720177Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3720307Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.3720349Z Autotune Choices Stats: 2025-12-04T09:58:54.3721112Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.3721332Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3721499Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3721796Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3722453Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3723094Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3723718Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3724348Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3724981Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3725612Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3726283Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3726973Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3727611Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3728240Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3728371Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.3728448Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3728490Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3728529Z unimplemented [] 2025-12-04T09:58:54.3728591Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3728695Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3729271Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3729313Z graph_break [] 2025-12-04T09:58:54.3729388Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3729430Z Autotune Choices Stats: 2025-12-04T09:58:54.3730175Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.3730315Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3730432Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3730592Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3731218Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3731836Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3732439Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3733049Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3733656Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3734263Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3734874Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3735510Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3736162Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3736771Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3736900Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.3736943Z Autotune Choices Stats: 2025-12-04T09:58:54.3737702Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.3737921Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3738088Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3738365Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3738998Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3739673Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3740307Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3740941Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3741571Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3742202Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3742829Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3743464Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3744120Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3744757Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3744887Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.3744962Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3745003Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3745040Z unimplemented [] 2025-12-04T09:58:54.3745100Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3745201Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3745771Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3745812Z graph_break [] 2025-12-04T09:58:54.3745888Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3745965Z Autotune Choices Stats: 2025-12-04T09:58:54.3746710Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.3746840Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3746957Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3747116Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3747780Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3748398Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3749004Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3749606Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3750215Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3750821Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3751426Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3752029Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3752670Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3753282Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3753413Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.3753455Z Autotune Choices Stats: 2025-12-04T09:58:54.3754213Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.3754435Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3754600Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3754877Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3755509Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3756166Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3756847Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3757479Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3758107Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3758746Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3759370Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3760000Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3760632Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3761296Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3761425Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.3761499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3761542Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3761579Z unimplemented [] 2025-12-04T09:58:54.3761639Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3761740Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3762316Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3762356Z graph_break [] 2025-12-04T09:58:54.3762432Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3762471Z Autotune Choices Stats: 2025-12-04T09:58:54.3763213Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:54.3763342Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3763454Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3763614Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3764224Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3764856Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3765465Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3766106Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3766715Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3767310Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3767913Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3768522Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3770102Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3770714Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3770845Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:54.3770884Z Autotune Choices Stats: 2025-12-04T09:58:54.3771648Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.3771864Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3772029Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3772307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3772936Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3773564Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3774191Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3774852Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3775487Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3776152Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3776773Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3777405Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3778035Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3778696Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3778836Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:54.3778908Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3778952Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3778988Z unimplemented [] 2025-12-04T09:58:54.3779050Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3779148Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3779724Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3779761Z graph_break [] 2025-12-04T09:58:54.3779835Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3779874Z Autotune Choices Stats: 2025-12-04T09:58:54.3780613Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.3780742Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3780855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3781016Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3781622Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3782232Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3782875Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3783486Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3784087Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3784688Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3785294Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3785896Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3786533Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3787186Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3787328Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:54.3787367Z Autotune Choices Stats: 2025-12-04T09:58:54.3788118Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.3788337Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3788504Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3788781Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3789413Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3790040Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3790669Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3791319Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3791952Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3792584Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3793206Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3793838Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3794462Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3795090Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3795231Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:54.3795324Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.3795372Z Traceback (most recent call last): 2025-12-04T09:58:54.3795544Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.3795594Z self.assertTrue( 2025-12-04T09:58:54.3795700Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.3795750Z raise self.failureException(msg) 2025-12-04T09:58:54.3795877Z AssertionError: False is not true : Log file /tmp/tmp5gigeitq/flex_attention_configs.json was not created 2025-12-04T09:58:54.3795881Z 2025-12-04T09:58:54.3795996Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.3796161Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.3796164Z 2025-12-04T09:58:54.3796257Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.3796332Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3796373Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3796410Z unimplemented [] 2025-12-04T09:58:54.3796473Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3797047Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.3797147Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3797183Z graph_break [] 2025-12-04T09:58:54.3797256Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3797743Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.3797791Z current_size = base.storage().size() 2025-12-04T09:58:54.3797833Z Autotune Choices Stats: 2025-12-04T09:58:54.3798580Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.3798709Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3798821Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3798981Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3799640Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3800262Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3800867Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3801469Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3802078Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3802675Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3803275Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3803901Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3804510Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3805108Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3805237Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.3805276Z Autotune Choices Stats: 2025-12-04T09:58:54.3806076Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.3806299Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3806462Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3806737Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3807364Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3807988Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3808661Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3809292Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3809918Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3810545Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3811162Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3811787Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3812443Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3813072Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3813200Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.3813276Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3813317Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3813354Z unimplemented [] 2025-12-04T09:58:54.3813416Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3813517Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3814095Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3814135Z graph_break [] 2025-12-04T09:58:54.3814208Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3814247Z Autotune Choices Stats: 2025-12-04T09:58:54.3814981Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.3815109Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3815226Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3815389Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3816040Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3816726Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3817339Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3817943Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3818550Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3819152Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3819758Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3820362Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3820996Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3821607Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3821737Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.3821782Z Autotune Choices Stats: 2025-12-04T09:58:54.3822550Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.3822770Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3822939Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3823220Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3823857Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3824477Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3825109Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3825759Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3826427Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3827054Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3827684Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3828317Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3828942Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3829605Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3829745Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.3829823Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3829865Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3829907Z unimplemented [] 2025-12-04T09:58:54.3829972Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3830078Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3830653Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3830692Z graph_break [] 2025-12-04T09:58:54.3830773Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3830814Z Autotune Choices Stats: 2025-12-04T09:58:54.3831557Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.3831687Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3831805Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3831968Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3832584Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3833190Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3833830Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3834438Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3835049Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3835654Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3836288Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3836886Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3837495Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3838147Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3838289Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.3838330Z Autotune Choices Stats: 2025-12-04T09:58:54.3839091Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.3839311Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3839479Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3839763Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3840395Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3841022Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3841647Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3842305Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3842945Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3843575Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3844205Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3844834Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3845466Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3846140Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3846288Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.3846363Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3846409Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3846449Z unimplemented [] 2025-12-04T09:58:54.3846514Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3846654Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3847231Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3847271Z graph_break [] 2025-12-04T09:58:54.3847353Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3847395Z Autotune Choices Stats: 2025-12-04T09:58:54.3848146Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.3848276Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3848389Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3848553Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3849165Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3849775Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3850383Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3851036Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3851651Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3852256Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3852856Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3853458Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3854057Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3854666Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3854808Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.3854848Z Autotune Choices Stats: 2025-12-04T09:58:54.3855623Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.3855850Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3856055Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3856341Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3856969Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3857597Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3858228Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3858858Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3859524Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3860165Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3860798Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3861427Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3862054Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3862691Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3862822Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.3862895Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3862942Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3862980Z unimplemented [] 2025-12-04T09:58:54.3863045Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3863171Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3863781Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3863837Z graph_break [] 2025-12-04T09:58:54.3863914Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3863955Z Autotune Choices Stats: 2025-12-04T09:58:54.3864695Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.3864828Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3864942Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3865109Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3865725Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3866369Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3866976Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3867577Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3868231Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3868849Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3869460Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3870063Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3870667Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3871269Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3871402Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.3871442Z Autotune Choices Stats: 2025-12-04T09:58:54.3872200Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.3872448Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3872627Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3872906Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3873541Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3874163Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3874792Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3875421Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3876088Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3876762Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3877398Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3878019Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3878649Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3879270Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3879401Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.3879474Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3879517Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3879555Z unimplemented [] 2025-12-04T09:58:54.3879617Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3879719Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3880295Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3880350Z graph_break [] 2025-12-04T09:58:54.3880424Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3880465Z Autotune Choices Stats: 2025-12-04T09:58:54.3881228Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.3881368Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3881482Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3881646Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3882262Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3882866Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3883477Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3884083Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3884691Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3885317Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3885964Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3886570Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3887180Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3887780Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3887913Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.3887953Z Autotune Choices Stats: 2025-12-04T09:58:54.3888716Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.3888958Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3889123Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3889427Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3890072Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3890703Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3891330Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3891958Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3892591Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3893216Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3893869Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3894507Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3895132Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3895755Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3895888Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.3896007Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3896050Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3896092Z unimplemented [] 2025-12-04T09:58:54.3896155Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3896259Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3896840Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3896882Z graph_break [] 2025-12-04T09:58:54.3896957Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3897003Z Autotune Choices Stats: 2025-12-04T09:58:54.3897738Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.3897929Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3898046Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3898207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3898829Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3899432Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3900033Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3900640Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3901249Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3901857Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3902482Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3903101Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3903708Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3904310Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3904445Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.3904489Z Autotune Choices Stats: 2025-12-04T09:58:54.3905247Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.3905472Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3905641Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3905971Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3906636Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3907276Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3907901Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3908527Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3909155Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3909780Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3910398Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3911063Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3911703Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3912328Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3912462Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.3912540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3912583Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3912625Z unimplemented [] 2025-12-04T09:58:54.3912687Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3912791Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3913367Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3913410Z graph_break [] 2025-12-04T09:58:54.3913485Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3913529Z Autotune Choices Stats: 2025-12-04T09:58:54.3914272Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.3914420Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3914539Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3914700Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3915340Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3915998Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3916598Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3917208Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3917812Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3918416Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3919017Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3919654Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3920265Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3920870Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3920998Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.3921044Z Autotune Choices Stats: 2025-12-04T09:58:54.3921802Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.3922023Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3922194Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3922471Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3923109Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3923763Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3924392Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3925016Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3925648Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3926316Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3926932Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3927568Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3928236Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3928883Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3929012Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.3929090Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3929135Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3929177Z unimplemented [] 2025-12-04T09:58:54.3929238Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3929345Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3929919Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3929961Z graph_break [] 2025-12-04T09:58:54.3930038Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3930080Z Autotune Choices Stats: 2025-12-04T09:58:54.3930820Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.3930951Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3931071Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3931234Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3931872Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3932482Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3933089Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3933690Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3934294Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3934904Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3935508Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3936156Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3936805Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3937419Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3937548Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.3937590Z Autotune Choices Stats: 2025-12-04T09:58:54.3938349Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.3938568Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3938738Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3939012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3939648Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3940272Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3940929Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3941559Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3942187Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3942819Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3943450Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3944070Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3944702Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3945369Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3945509Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.3945586Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3945634Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3945671Z unimplemented [] 2025-12-04T09:58:54.3945734Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3945838Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3946456Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3946499Z graph_break [] 2025-12-04T09:58:54.3946579Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3946621Z Autotune Choices Stats: 2025-12-04T09:58:54.3947369Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.3947500Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3947616Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3947783Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3948394Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3949046Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3949660Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3950264Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3950870Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3951476Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3952086Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3952686Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3953301Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3953935Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3954070Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.3954111Z Autotune Choices Stats: 2025-12-04T09:58:54.3954870Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.3955092Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3955260Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3955543Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3956217Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3956847Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3957472Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3958141Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3958781Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3959410Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3960040Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3960665Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3961296Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3961961Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3962105Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.3962180Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3962228Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3962269Z unimplemented [] 2025-12-04T09:58:54.3962332Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3962433Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3963012Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3963053Z graph_break [] 2025-12-04T09:58:54.3963130Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3963172Z Autotune Choices Stats: 2025-12-04T09:58:54.3963906Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.3964042Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3964156Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3964319Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3964932Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3965534Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3966215Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3966823Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3967431Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3968037Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3968643Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3969246Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3969849Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3970482Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3970623Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.3970665Z Autotune Choices Stats: 2025-12-04T09:58:54.3971432Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.3971655Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3971826Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3972109Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3972740Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3973367Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3973999Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3974651Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3975291Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3975967Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3976590Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3977220Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3977851Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3978479Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3978634Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.3978709Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3978754Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3978793Z unimplemented [] 2025-12-04T09:58:54.3978863Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3978989Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3979583Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.3979625Z graph_break [] 2025-12-04T09:58:54.3979701Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3979747Z Autotune Choices Stats: 2025-12-04T09:58:54.3980485Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.3980614Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3980729Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3980892Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3981511Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3982117Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3982723Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3983355Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3983969Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3984577Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3985184Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3985787Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3986420Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3987021Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3987167Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.3987208Z Autotune Choices Stats: 2025-12-04T09:58:54.3987987Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.3988220Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3988384Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3988669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3989296Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3989919Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3990546Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3991173Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3991838Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3992478Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3993103Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3993733Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3994366Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3994992Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3995124Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.3995199Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.3995245Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.3995283Z unimplemented [] 2025-12-04T09:58:54.3995349Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.3995460Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.3996120Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.3996173Z graph_break [] 2025-12-04T09:58:54.3996247Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.3996292Z Autotune Choices Stats: 2025-12-04T09:58:54.3997032Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.3997165Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.3997284Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.3997447Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.3998069Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.3998672Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3999278Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.3999885Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4000524Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4001138Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4001748Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4002358Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4002961Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4003567Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4003700Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.4003743Z Autotune Choices Stats: 2025-12-04T09:58:54.4004504Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.4004756Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4004929Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4005209Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4005837Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4006497Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4007122Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4007742Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4008374Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4009046Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4009675Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4010308Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4010934Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4011556Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4011691Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.4011769Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4011812Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4011854Z unimplemented [] 2025-12-04T09:58:54.4011916Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4012019Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4012600Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4012653Z graph_break [] 2025-12-04T09:58:54.4012730Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4012775Z Autotune Choices Stats: 2025-12-04T09:58:54.4013535Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.4013677Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4013794Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4013953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4014576Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4015182Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4015785Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4016428Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4017027Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4017670Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4018279Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4018887Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4019488Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4020090Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4020223Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.4020266Z Autotune Choices Stats: 2025-12-04T09:58:54.4021015Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.4021246Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4021414Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4021711Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4022353Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4022971Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4023594Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4024220Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4024854Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4025483Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4026203Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4026850Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4027482Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4028108Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4028238Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.4028317Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4028361Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4028403Z unimplemented [] 2025-12-04T09:58:54.4028464Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4028570Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4029143Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4029184Z graph_break [] 2025-12-04T09:58:54.4029259Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4029302Z Autotune Choices Stats: 2025-12-04T09:58:54.4030033Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.4030195Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4030326Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4030486Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4031098Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4031702Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4032309Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4032916Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4033523Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4034124Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4034758Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4035371Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4036008Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4036616Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4036747Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.4036790Z Autotune Choices Stats: 2025-12-04T09:58:54.4037544Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.4037766Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4037933Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4038230Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4038888Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4039525Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4040155Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4040782Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4041414Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4042043Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4042669Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4043336Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4043986Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4044617Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4044748Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.4044828Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4044871Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4044914Z unimplemented [] 2025-12-04T09:58:54.4044975Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4045079Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4045653Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4045696Z graph_break [] 2025-12-04T09:58:54.4045774Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4045817Z Autotune Choices Stats: 2025-12-04T09:58:54.4046588Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.4046728Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4046844Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4047007Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4047641Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4048257Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4048865Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4049465Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4050075Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4050681Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4051287Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4051916Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4052528Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4053131Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4053261Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.4053304Z Autotune Choices Stats: 2025-12-04T09:58:54.4054073Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.4054294Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4054460Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4054744Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4055378Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4056090Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4056732Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4057355Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4057984Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4058618Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4059244Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4059872Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4060533Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4061165Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4061299Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.4061375Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4061421Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4061460Z unimplemented [] 2025-12-04T09:58:54.4061525Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4061626Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4062211Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4062250Z graph_break [] 2025-12-04T09:58:54.4062325Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4062366Z Autotune Choices Stats: 2025-12-04T09:58:54.4063106Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.4063238Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4063353Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4063516Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4064143Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4064785Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4065388Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4066022Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4066632Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4067238Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4067845Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4068451Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4069089Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4069705Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4069839Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.4069880Z Autotune Choices Stats: 2025-12-04T09:58:54.4070649Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.4070869Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4071036Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4071317Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4071948Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4072580Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4073241Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4073876Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4074507Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4075137Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4075765Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4076430Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4077052Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4077723Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4077869Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.4077945Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4077994Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4078032Z unimplemented [] 2025-12-04T09:58:54.4078099Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4078202Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4078788Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4078828Z graph_break [] 2025-12-04T09:58:54.4078906Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4078949Z Autotune Choices Stats: 2025-12-04T09:58:54.4079697Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.4079829Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4079945Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4080113Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4080727Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4081362Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4081971Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4082583Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4083184Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4083790Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4084397Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4085004Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4085604Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4086293Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4086438Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.4086480Z Autotune Choices Stats: 2025-12-04T09:58:54.4087241Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.4087462Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4087626Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4087907Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4088543Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4089177Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4089808Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4090480Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4091124Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4091754Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4092381Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4093014Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4093642Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4094270Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4094451Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.4094528Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4094575Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4094615Z unimplemented [] 2025-12-04T09:58:54.4094681Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4094782Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4095357Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4095400Z graph_break [] 2025-12-04T09:58:54.4095475Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4095519Z Autotune Choices Stats: 2025-12-04T09:58:54.4096289Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.4096423Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4096539Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4096708Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4097323Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4097925Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4098589Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4099208Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4099814Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4100414Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4101023Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4101631Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4102233Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4102857Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4103021Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.4103065Z Autotune Choices Stats: 2025-12-04T09:58:54.4103828Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.4104051Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4104218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4104498Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4105143Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4105767Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4106434Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4107063Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4107756Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4108383Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4109013Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4109649Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4110278Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4110902Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4111046Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.4111123Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4111172Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4111211Z unimplemented [] 2025-12-04T09:58:54.4111279Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4111401Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4111987Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4112033Z graph_break [] 2025-12-04T09:58:54.4112107Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4112154Z Autotune Choices Stats: 2025-12-04T09:58:54.4112907Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.4113040Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4113158Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4113318Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4113934Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4114542Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4115149Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4115780Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4116415Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4117023Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4117628Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4118234Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4118842Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4119445Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4119598Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.4119643Z Autotune Choices Stats: 2025-12-04T09:58:54.4120417Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.4120657Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4120824Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4121105Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4121742Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4122368Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4123003Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4123626Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4124291Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4124939Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4125567Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4126229Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4126861Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4127482Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4127618Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.4127699Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4127743Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4127787Z unimplemented [] 2025-12-04T09:58:54.4127849Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4127953Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4128570Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4128625Z graph_break [] 2025-12-04T09:58:54.4128700Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4128747Z Autotune Choices Stats: 2025-12-04T09:58:54.4129483Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.4129620Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4129737Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4129897Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4130509Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4131113Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4131718Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4132327Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4132960Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4133571Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4134174Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4134779Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4135383Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4136028Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4136161Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.4136207Z Autotune Choices Stats: 2025-12-04T09:58:54.4136973Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.4137239Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4137421Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4137696Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4138340Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4138968Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4139591Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4140214Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4140860Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4141520Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4142156Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4142787Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4143415Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4144044Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4144174Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.4144252Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4144297Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4144340Z unimplemented [] 2025-12-04T09:58:54.4144406Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4144510Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4145088Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4145141Z graph_break [] 2025-12-04T09:58:54.4145216Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4145264Z Autotune Choices Stats: 2025-12-04T09:58:54.4146060Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.4146202Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4146323Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4146485Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4147102Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4147712Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4148320Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4148933Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4149537Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4150176Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4150794Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4151402Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4152009Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4152621Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4152756Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.4152800Z Autotune Choices Stats: 2025-12-04T09:58:54.4153561Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.4153783Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4153961Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4154266Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4154918Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4155548Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4156192Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4156823Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4157457Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4158089Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4158764Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4159400Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4160035Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4160666Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4160797Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.4160876Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4160919Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4160964Z unimplemented [] 2025-12-04T09:58:54.4161027Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4161133Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4161712Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4161752Z graph_break [] 2025-12-04T09:58:54.4161828Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4161870Z Autotune Choices Stats: 2025-12-04T09:58:54.4162615Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.4162777Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4162905Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4163071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4163683Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4164296Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4164908Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4165513Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4166157Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4166771Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4167423Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4168037Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4168652Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4169259Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4169391Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.4169438Z Autotune Choices Stats: 2025-12-04T09:58:54.4170204Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.4170423Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4170593Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4170870Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4171551Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4172194Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4172822Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4173446Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4174081Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4174711Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4175340Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4176058Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4176702Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4177332Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4177467Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.4177541Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4177587Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4177624Z unimplemented [] 2025-12-04T09:58:54.4177687Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4177786Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4178366Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4178405Z graph_break [] 2025-12-04T09:58:54.4178480Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4178522Z Autotune Choices Stats: 2025-12-04T09:58:54.4179261Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.4179411Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4179524Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4179688Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4180317Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4180929Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4181535Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4182143Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4182747Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4183354Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4183959Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4184593Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4185210Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4185818Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4185987Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.4186028Z Autotune Choices Stats: 2025-12-04T09:58:54.4186791Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.4187006Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4187174Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4187456Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4188088Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4188754Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4189388Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4190016Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4190645Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4191274Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4191902Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4192532Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4193186Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4193826Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4193960Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.4194034Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4194078Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4194115Z unimplemented [] 2025-12-04T09:58:54.4194178Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4194280Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4194850Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4194888Z graph_break [] 2025-12-04T09:58:54.4194965Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4195005Z Autotune Choices Stats: 2025-12-04T09:58:54.4195753Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.4195884Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4196041Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4196205Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4196834Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4197488Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4198095Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4200678Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4201300Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4201909Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4202514Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4203121Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4203775Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4204392Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4204524Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.4204564Z Autotune Choices Stats: 2025-12-04T09:58:54.4205327Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.4205547Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4205711Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4206026Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4206658Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4207288Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4207955Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4208594Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4209220Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4209854Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4210475Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4211102Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4211737Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4212392Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4212530Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.4212606Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4212649Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4212687Z unimplemented [] 2025-12-04T09:58:54.4212748Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4212851Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4213432Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4213472Z graph_break [] 2025-12-04T09:58:54.4213546Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4213587Z Autotune Choices Stats: 2025-12-04T09:58:54.4214323Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:54.4214453Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4214565Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4214725Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4215340Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4216017Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4216630Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4217235Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4217840Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4218443Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4219045Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4219653Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4220266Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4220905Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4221035Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:54.4221074Z Autotune Choices Stats: 2025-12-04T09:58:54.4221834Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.4222053Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4222216Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4222495Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4223124Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4223756Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4224382Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4225033Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4225675Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4226334Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4226955Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4227583Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4228211Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4228874Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4229017Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:54.4229094Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4229135Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4229173Z unimplemented [] 2025-12-04T09:58:54.4229235Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4229338Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4229919Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4229956Z graph_break [] 2025-12-04T09:58:54.4230029Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4230070Z Autotune Choices Stats: 2025-12-04T09:58:54.4230804Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.4230936Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4231050Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4231209Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4231820Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4232426Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4233063Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4233679Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4234282Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4234883Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4235487Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4236113Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4236715Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4237371Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4237511Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:54.4237552Z Autotune Choices Stats: 2025-12-04T09:58:54.4238307Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.4238527Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4238690Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4238971Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4239604Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4240227Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4240849Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4241502Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4242140Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4242765Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4243385Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4244017Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4244642Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4245268Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4245408Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:54.4245484Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4245524Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4245562Z unimplemented [] 2025-12-04T09:58:54.4245621Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4245741Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4246350Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4246391Z graph_break [] 2025-12-04T09:58:54.4246463Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4246503Z Autotune Choices Stats: 2025-12-04T09:58:54.4247245Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:54.4247371Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4247485Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4247647Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4248256Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4248862Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4249462Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4250114Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4250732Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4251335Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4251942Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4252547Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4253158Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4253764Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4253907Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:54.4253948Z Autotune Choices Stats: 2025-12-04T09:58:54.4254732Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.4254960Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4255124Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4255405Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4256080Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4256705Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4257327Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4257958Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4258628Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4259263Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4259893Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4260522Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4261149Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4261777Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4261904Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:54.4261996Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.4262043Z Traceback (most recent call last): 2025-12-04T09:58:54.4262196Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.4262246Z self.assertTrue( 2025-12-04T09:58:54.4262352Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.4262401Z raise self.failureException(msg) 2025-12-04T09:58:54.4262530Z AssertionError: False is not true : Log file /tmp/tmpoqq18k0p/flex_attention_configs.json was not created 2025-12-04T09:58:54.4262533Z 2025-12-04T09:58:54.4262607Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.4262801Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.4262804Z 2025-12-04T09:58:54.4262893Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.4262967Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4263010Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4263048Z unimplemented [] 2025-12-04T09:58:54.4263110Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4263692Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.4263791Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4263826Z graph_break [] 2025-12-04T09:58:54.4263902Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4264387Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.4264436Z current_size = base.storage().size() 2025-12-04T09:58:54.4264476Z Autotune Choices Stats: 2025-12-04T09:58:54.4265216Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.4265348Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4265460Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4265622Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4266273Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4266918Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4267534Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4268136Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4268735Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4269340Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4269938Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4270535Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4271164Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4271775Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4271907Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.4271946Z Autotune Choices Stats: 2025-12-04T09:58:54.4272699Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.4272919Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4273082Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4273362Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4273990Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4274617Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4275245Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4275898Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4276563Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4277188Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4277810Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4278434Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4279057Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4279725Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4279868Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.4279942Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4279985Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4280022Z unimplemented [] 2025-12-04T09:58:54.4280086Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4280183Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4280762Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4280799Z graph_break [] 2025-12-04T09:58:54.4280871Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4280910Z Autotune Choices Stats: 2025-12-04T09:58:54.4281653Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.4281781Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4281893Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4282054Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4282666Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4283266Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4283902Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4284510Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4285113Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4285716Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4286355Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4286953Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4287553Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4288197Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4288336Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.4288375Z Autotune Choices Stats: 2025-12-04T09:58:54.4289127Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.4289346Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4289511Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4289786Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4290416Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4291036Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4291662Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4292310Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4292942Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4293564Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4294183Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4294808Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4295428Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4296097Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4296239Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.4296313Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4296356Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4296393Z unimplemented [] 2025-12-04T09:58:54.4296454Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4296578Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4297165Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4297204Z graph_break [] 2025-12-04T09:58:54.4297276Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4297317Z Autotune Choices Stats: 2025-12-04T09:58:54.4298055Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.4298183Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4298297Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4298455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4299061Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4299673Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4300276Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4300914Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4301524Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4302129Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4302729Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4303327Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4303929Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4304528Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4304669Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.4304708Z Autotune Choices Stats: 2025-12-04T09:58:54.4305483Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.4305709Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4305872Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4306186Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4306821Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4307441Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4308065Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4308687Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4310879Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4311519Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4312143Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4312770Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4313403Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4314027Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4314156Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.4314231Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4314271Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4314308Z unimplemented [] 2025-12-04T09:58:54.4314368Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4314480Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4315070Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4315121Z graph_break [] 2025-12-04T09:58:54.4315194Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4315234Z Autotune Choices Stats: 2025-12-04T09:58:54.4316005Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.4316135Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4316249Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4316407Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4317022Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4317625Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4318229Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4318827Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4319471Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4320076Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4320676Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4321284Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4321883Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4322484Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4322612Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.4322653Z Autotune Choices Stats: 2025-12-04T09:58:54.4323409Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.4323652Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4323825Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4324101Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4324724Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4325353Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4326009Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4326630Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4327261Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4327926Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4328557Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4329185Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4329815Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4330437Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4330566Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.4330645Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4330689Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4330730Z unimplemented [] 2025-12-04T09:58:54.4330791Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4330894Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4331478Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4331527Z graph_break [] 2025-12-04T09:58:54.4331601Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4331644Z Autotune Choices Stats: 2025-12-04T09:58:54.4332410Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.4332548Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4332665Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4332823Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4333441Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4334044Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4334645Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4335249Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4335852Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4336525Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4337143Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4337750Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4338352Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4338955Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4339084Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.4339129Z Autotune Choices Stats: 2025-12-04T09:58:54.4339890Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.4340119Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4340284Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4340579Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4341225Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4341852Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4342473Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4343136Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4343767Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4344396Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4345048Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4345677Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4346341Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4346965Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4347094Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.4347171Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4347214Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4347251Z unimplemented [] 2025-12-04T09:58:54.4347312Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4347414Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4347996Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4348032Z graph_break [] 2025-12-04T09:58:54.4348108Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4348148Z Autotune Choices Stats: 2025-12-04T09:58:54.4348890Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.4349087Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4349204Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4349367Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4349977Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4350580Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4351190Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4351795Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4352395Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4353001Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4353629Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4354238Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4354842Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4355450Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4355579Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.4355622Z Autotune Choices Stats: 2025-12-04T09:58:54.4356421Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.4356640Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4356809Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4357103Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4357760Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4358394Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4359018Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4359645Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4360270Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4360901Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4361533Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4362187Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4362827Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4363453Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4363583Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.4363658Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4363703Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4363740Z unimplemented [] 2025-12-04T09:58:54.4363803Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4363902Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4364477Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4364514Z graph_break [] 2025-12-04T09:58:54.4364591Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4364630Z Autotune Choices Stats: 2025-12-04T09:58:54.4365372Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.4365523Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4365636Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4365799Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4366487Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4367101Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4367705Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4368307Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4368908Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4369526Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4370127Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4370761Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4371371Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4371977Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4372107Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.4372149Z Autotune Choices Stats: 2025-12-04T09:58:54.4372914Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.4373132Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4373300Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4373579Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4374208Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4374862Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4375497Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4376154Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4376779Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4377411Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4378035Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4378664Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4379335Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4379980Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4380112Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.4380186Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4380231Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4380268Z unimplemented [] 2025-12-04T09:58:54.4380331Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4380432Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4381014Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4381052Z graph_break [] 2025-12-04T09:58:54.4381128Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4381169Z Autotune Choices Stats: 2025-12-04T09:58:54.4381910Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.4382039Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4382152Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4382315Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4382930Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4383555Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4384158Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4384763Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4385370Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4386006Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4386608Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4387209Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4387853Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4388469Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4388600Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.4388640Z Autotune Choices Stats: 2025-12-04T09:58:54.4389402Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.4389624Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4389789Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4390066Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4390696Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4391320Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4391970Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4392601Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4393234Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4393868Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4394487Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4395116Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4395742Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4396446Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4396590Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.4396664Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4396710Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4396748Z unimplemented [] 2025-12-04T09:58:54.4396809Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4396910Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4397488Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4397527Z graph_break [] 2025-12-04T09:58:54.4397601Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4397643Z Autotune Choices Stats: 2025-12-04T09:58:54.4398373Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.4398502Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4398616Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4398779Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4399397Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4400035Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4400654Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4401264Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4401866Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4403945Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4404551Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4405161Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4405780Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4406450Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4406583Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.4406623Z Autotune Choices Stats: 2025-12-04T09:58:54.4407387Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.4407606Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4407774Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4408118Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4408749Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4409376Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4409995Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4410648Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4411277Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4411905Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4412532Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4413167Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4413792Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4414431Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4414580Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.4414656Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4414700Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4414738Z unimplemented [] 2025-12-04T09:58:54.4414800Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4414903Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4415478Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4415516Z graph_break [] 2025-12-04T09:58:54.4415592Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4415632Z Autotune Choices Stats: 2025-12-04T09:58:54.4416418Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.4416565Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4416678Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4416841Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4417443Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4418042Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4418686Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4419285Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4419887Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4420489Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4421108Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4421708Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4422308Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4422941Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4423073Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.4423113Z Autotune Choices Stats: 2025-12-04T09:58:54.4423866Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.4424088Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4424251Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4424528Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4425157Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4425794Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4426439Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4427080Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4427730Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4428358Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4428985Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4429634Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4430249Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4430877Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4431017Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.4431093Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4431139Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4431177Z unimplemented [] 2025-12-04T09:58:54.4431240Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4431359Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4431932Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4431974Z graph_break [] 2025-12-04T09:58:54.4432048Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4432091Z Autotune Choices Stats: 2025-12-04T09:58:54.4432826Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.4432956Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4433070Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4433244Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4433859Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4434461Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4435067Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4435695Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4436328Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4436928Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4437533Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4438152Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4438752Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4439360Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4439501Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.4439543Z Autotune Choices Stats: 2025-12-04T09:58:54.4440322Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.4440539Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4440703Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4440986Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4441617Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4442253Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4442871Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4443499Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4444161Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4444784Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4445418Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4446080Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4446722Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4447347Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4447479Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.4447554Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4447598Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4447635Z unimplemented [] 2025-12-04T09:58:54.4447697Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4447813Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4448412Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4448453Z graph_break [] 2025-12-04T09:58:54.4448526Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4448570Z Autotune Choices Stats: 2025-12-04T09:58:54.4449315Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.4449446Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4449562Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4449723Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4450334Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4450947Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4451552Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4452153Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4452778Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4453385Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4453986Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4454595Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4455214Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4455816Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4455988Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.4456031Z Autotune Choices Stats: 2025-12-04T09:58:54.4456786Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.4457044Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4457214Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4457488Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4458123Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4458748Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4459381Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4460004Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4460634Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4461298Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4461920Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4462549Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4463182Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4463819Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4463945Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.4464023Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4464065Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4464110Z unimplemented [] 2025-12-04T09:58:54.4464171Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4464276Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4464853Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4464903Z graph_break [] 2025-12-04T09:58:54.4464978Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4465021Z Autotune Choices Stats: 2025-12-04T09:58:54.4465778Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.4465907Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4466060Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4466222Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4466843Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4467451Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4468067Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4468671Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4469281Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4469924Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4470526Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4471137Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4471742Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4472356Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4472485Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.4472530Z Autotune Choices Stats: 2025-12-04T09:58:54.4473287Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.4473508Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4473686Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4473983Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4474618Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4475250Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4475872Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4476553Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4477188Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4477816Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4478478Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4479107Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4479736Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4480359Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4480519Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.4480597Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4480640Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4480681Z unimplemented [] 2025-12-04T09:58:54.4480743Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4480844Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4481424Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4481465Z graph_break [] 2025-12-04T09:58:54.4481542Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4481582Z Autotune Choices Stats: 2025-12-04T09:58:54.4482315Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.4482478Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4482595Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4482756Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4483373Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4483979Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4484584Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4485203Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4485809Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4486438Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4487082Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4487685Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4488289Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4488893Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4489037Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.4489080Z Autotune Choices Stats: 2025-12-04T09:58:54.4489842Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.4490061Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4490227Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4490503Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4491165Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4491793Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4492418Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4493035Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4493688Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4494313Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4494936Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4495591Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4496259Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4496884Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4497016Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.4497090Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4497135Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4497192Z unimplemented [] 2025-12-04T09:58:54.4497254Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4497359Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4497937Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4497976Z graph_break [] 2025-12-04T09:58:54.4498056Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4498098Z Autotune Choices Stats: 2025-12-04T09:58:54.4498833Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.4498974Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4499088Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4499256Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4499888Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4500487Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4501091Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4501696Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4502312Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4502913Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4503520Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4504149Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4504752Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4505357Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4505488Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.4505529Z Autotune Choices Stats: 2025-12-04T09:58:54.4506341Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.4506569Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4506738Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4507023Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4507658Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4508328Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4508946Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4509570Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4510198Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4510833Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4511465Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4512091Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4512746Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4513370Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4513502Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.4513576Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4513619Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4513656Z unimplemented [] 2025-12-04T09:58:54.4513718Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4513817Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4514396Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4514443Z graph_break [] 2025-12-04T09:58:54.4514519Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4514560Z Autotune Choices Stats: 2025-12-04T09:58:54.4515295Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.4515425Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4515538Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4515702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4516338Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4516981Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4517584Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4518186Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4518785Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4519408Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4520012Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4520618Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4521249Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4521855Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4521989Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.4522029Z Autotune Choices Stats: 2025-12-04T09:58:54.4522784Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.4523015Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4523180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4523462Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4524094Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4524720Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4525387Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4526049Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4526677Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4527305Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4527945Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4528578Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4529200Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4529868Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4530001Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.4530076Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4530123Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4530162Z unimplemented [] 2025-12-04T09:58:54.4530225Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4530324Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4530902Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4530943Z graph_break [] 2025-12-04T09:58:54.4531017Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4531060Z Autotune Choices Stats: 2025-12-04T09:58:54.4531807Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.4531949Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4532061Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4532225Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4532842Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4533457Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4534081Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4534684Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4535289Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4535896Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4536545Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4537150Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4537754Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4538395Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4538528Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.4538569Z Autotune Choices Stats: 2025-12-04T09:58:54.4539324Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.4539544Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4539712Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4540010Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4540685Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4541361Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4542090Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4542798Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4543425Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4544140Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4544830Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4545469Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4546133Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4546759Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4546942Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.4547020Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4547064Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4547104Z unimplemented [] 2025-12-04T09:58:54.4547167Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4547268Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4547847Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4547888Z graph_break [] 2025-12-04T09:58:54.4547963Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4548008Z Autotune Choices Stats: 2025-12-04T09:58:54.4548742Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.4548905Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4549023Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4549182Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4549795Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4550412Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4551047Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4551650Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4552257Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4552864Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4553481Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4554086Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4554696Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4555310Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4555462Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.4555506Z Autotune Choices Stats: 2025-12-04T09:58:54.4556293Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.4556519Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4556684Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4556962Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4557598Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4558243Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4558871Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4559490Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4560170Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4560802Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4561423Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4562054Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4562689Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4563323Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4563461Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.4563540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4563583Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4563624Z unimplemented [] 2025-12-04T09:58:54.4563685Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4563788Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4564381Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4564423Z graph_break [] 2025-12-04T09:58:54.4564499Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4564542Z Autotune Choices Stats: 2025-12-04T09:58:54.4565284Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.4565412Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4565529Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4565692Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4566352Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4566957Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4567560Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4568207Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4568821Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4569429Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4570030Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4570654Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4571258Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4571863Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4572004Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.4572049Z Autotune Choices Stats: 2025-12-04T09:58:54.4572843Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.4573063Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4573233Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4573511Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4574148Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4574785Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4575420Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4576085Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4576762Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4577390Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4578010Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4578638Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4579286Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4579911Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4580043Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.4580119Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4580163Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4580203Z unimplemented [] 2025-12-04T09:58:54.4580264Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4580372Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4580957Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4581019Z graph_break [] 2025-12-04T09:58:54.4581097Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4581137Z Autotune Choices Stats: 2025-12-04T09:58:54.4581874Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.4582003Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4582120Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4582280Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4582896Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4583514Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4584118Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4584723Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4585353Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4586009Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4586613Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4587218Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4587840Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4588442Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4588574Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.4588618Z Autotune Choices Stats: 2025-12-04T09:58:54.4589379Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.4589639Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4589804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4590081Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4590714Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4591344Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4591981Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4592604Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4593242Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4593903Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4594524Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4595154Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4595781Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4596459Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4596591Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.4596665Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4596711Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4596750Z unimplemented [] 2025-12-04T09:58:54.4596814Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4596917Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4597495Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4597552Z graph_break [] 2025-12-04T09:58:54.4597629Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4597671Z Autotune Choices Stats: 2025-12-04T09:58:54.4598438Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.4598569Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4598683Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4598846Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4599458Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4600060Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4600682Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4601286Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4601889Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4602520Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4603129Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4603725Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4604325Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4604939Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4605073Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.4605114Z Autotune Choices Stats: 2025-12-04T09:58:54.4605870Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.4606124Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4606319Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4606625Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4607261Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4607884Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4608506Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4609145Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4609771Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4610396Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4611189Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4611813Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4612444Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4613062Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4613211Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.4613287Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4613333Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4613373Z unimplemented [] 2025-12-04T09:58:54.4613436Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4613534Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4614112Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4614152Z graph_break [] 2025-12-04T09:58:54.4614229Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4614270Z Autotune Choices Stats: 2025-12-04T09:58:54.4615003Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.4615166Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4615282Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4615445Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4616087Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4616694Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4617303Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4617918Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4618520Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4619124Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4619772Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4620373Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4620980Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4621589Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4621746Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.4621788Z Autotune Choices Stats: 2025-12-04T09:58:54.4622544Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.4622766Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4622930Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4623212Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4623875Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4624494Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4625121Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4625753Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4626429Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4627056Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4627684Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4628352Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4628978Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4629607Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4629738Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.4629813Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4629856Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4629912Z unimplemented [] 2025-12-04T09:58:54.4629976Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4630075Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4630653Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4630694Z graph_break [] 2025-12-04T09:58:54.4630767Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4630809Z Autotune Choices Stats: 2025-12-04T09:58:54.4631547Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.4631678Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4631803Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4631966Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4632609Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4633214Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4633824Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4634429Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4635043Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4635650Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4636289Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4636940Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4637540Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4638149Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4638283Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.4638322Z Autotune Choices Stats: 2025-12-04T09:58:54.4639075Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.4639307Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4639474Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4639759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4640399Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4641057Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4641686Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4642312Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4642944Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4643589Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4644217Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4644841Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4645497Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4646178Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4646313Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.4646387Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4646432Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4646469Z unimplemented [] 2025-12-04T09:58:54.4646531Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4646630Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4647205Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4647259Z graph_break [] 2025-12-04T09:58:54.4647333Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4647377Z Autotune Choices Stats: 2025-12-04T09:58:54.4648121Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.4648253Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4648368Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4648533Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4649138Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4649781Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4650386Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4650993Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4651609Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4652223Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4652829Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4653436Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4654082Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4654685Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4654817Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.4654860Z Autotune Choices Stats: 2025-12-04T09:58:54.4655613Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.4655845Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4656051Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4656329Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4656967Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4657591Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4658258Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4658884Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4659513Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4660147Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4660782Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4661410Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4662044Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4662701Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4662833Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.4662908Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4662950Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4662990Z unimplemented [] 2025-12-04T09:58:54.4663051Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4663154Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4663731Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4663771Z graph_break [] 2025-12-04T09:58:54.4663846Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4663886Z Autotune Choices Stats: 2025-12-04T09:58:54.4664627Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.4664767Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4664884Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4665046Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4665664Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4666338Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4666938Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4667544Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4668156Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4668762Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4669374Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4669983Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4670585Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4671232Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4671360Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.4671403Z Autotune Choices Stats: 2025-12-04T09:58:54.4672158Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.4672374Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4672540Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4672830Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4673463Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4674098Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4674719Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4675378Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4676048Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4676677Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4677300Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4677942Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4678575Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4679200Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4679370Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.4679447Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4679489Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4679529Z unimplemented [] 2025-12-04T09:58:54.4679591Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4679692Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4680272Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4680312Z graph_break [] 2025-12-04T09:58:54.4680389Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4680429Z Autotune Choices Stats: 2025-12-04T09:58:54.4681168Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:54.4681315Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4681429Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4681589Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4682200Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4682803Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4683437Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4684046Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4684659Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4685263Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4685885Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4686525Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4687124Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4687770Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4687899Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:54.4687941Z Autotune Choices Stats: 2025-12-04T09:58:54.4688705Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.4688926Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4689094Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4689372Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4690002Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4690644Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4691270Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4691906Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4692558Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4693189Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4693817Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4694455Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4695084Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4695711Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4695850Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:54.4695962Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4696006Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4696046Z unimplemented [] 2025-12-04T09:58:54.4696106Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4696243Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4696819Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4696857Z graph_break [] 2025-12-04T09:58:54.4696933Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4696973Z Autotune Choices Stats: 2025-12-04T09:58:54.4697716Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.4697845Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4697960Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4698143Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4698756Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4699369Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4699977Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4700622Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4701226Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4701829Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4702436Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4703051Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4703658Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4704274Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4704415Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:54.4704455Z Autotune Choices Stats: 2025-12-04T09:58:54.4705239Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.4705457Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4705622Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4705908Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4706584Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4707228Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4710749Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4711390Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4712092Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4712723Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4713353Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4713980Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4714619Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4715249Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4715380Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:54.4715459Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4715507Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4715546Z unimplemented [] 2025-12-04T09:58:54.4715612Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4715725Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4716450Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4716489Z graph_break [] 2025-12-04T09:58:54.4716567Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4716607Z Autotune Choices Stats: 2025-12-04T09:58:54.4717357Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:54.4717491Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4717606Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4717769Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4718377Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4718993Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4719596Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4720197Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4720835Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4721438Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4722049Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4722654Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4723272Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4723877Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4724008Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:54.4724047Z Autotune Choices Stats: 2025-12-04T09:58:54.4724808Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.4725058Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4725227Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4725511Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4726187Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4726814Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4727469Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4728095Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4728726Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4729395Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4730023Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4730652Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4731274Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4731910Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4732039Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:54.4732114Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4732160Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4732197Z unimplemented [] 2025-12-04T09:58:54.4732259Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4732361Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4732938Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4732988Z graph_break [] 2025-12-04T09:58:54.4733064Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4733104Z Autotune Choices Stats: 2025-12-04T09:58:54.4733859Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:54.4733987Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4734102Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4734265Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4734885Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4735492Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4736148Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4736750Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4737352Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4738001Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4738603Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4739211Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4739810Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4740428Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4740558Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:54.4740597Z Autotune Choices Stats: 2025-12-04T09:58:54.4741362Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.4741593Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4741756Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4742055Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4742680Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4743312Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4743933Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4744568Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4745201Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4745828Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4746527Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4747151Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4747777Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4748403Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4748546Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:54.4748638Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.4748688Z Traceback (most recent call last): 2025-12-04T09:58:54.4748840Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.4748883Z self.assertTrue( 2025-12-04T09:58:54.4748988Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.4749039Z raise self.failureException(msg) 2025-12-04T09:58:54.4749165Z AssertionError: False is not true : Log file /tmp/tmpq1f39omc/flex_attention_configs.json was not created 2025-12-04T09:58:54.4749170Z 2025-12-04T09:58:54.4749245Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.4749411Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.4749414Z 2025-12-04T09:58:54.4749502Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.4749582Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4749625Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4749685Z unimplemented [] 2025-12-04T09:58:54.4749746Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4750342Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.4750443Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4750482Z graph_break [] 2025-12-04T09:58:54.4750556Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4751049Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.4751100Z current_size = base.storage().size() 2025-12-04T09:58:54.4751142Z Autotune Choices Stats: 2025-12-04T09:58:54.4751881Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.4752011Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4752124Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4752302Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4752917Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4753523Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4754121Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4754754Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4755361Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4756003Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4756602Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4757227Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4757829Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4758429Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4758572Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.4758613Z Autotune Choices Stats: 2025-12-04T09:58:54.4759400Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.4759620Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4759784Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4760061Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4760690Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4761330Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4761949Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4762569Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4763228Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4763852Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4764470Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4765097Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4765734Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4766399Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4766529Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.4766604Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4766645Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4766682Z unimplemented [] 2025-12-04T09:58:54.4766743Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4766858Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4767464Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4767503Z graph_break [] 2025-12-04T09:58:54.4767578Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4767617Z Autotune Choices Stats: 2025-12-04T09:58:54.4768347Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.4768476Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4768593Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4768753Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4769366Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4769980Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4770579Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4771181Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4771813Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4772409Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4773011Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4773606Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4774215Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4774816Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4774945Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.4774987Z Autotune Choices Stats: 2025-12-04T09:58:54.4775741Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.4776036Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4776202Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4776479Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4777113Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4777736Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4778373Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4778992Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4779621Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4780277Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4780890Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4781524Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4782144Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4782781Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4782911Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.4782984Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4783027Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4783063Z unimplemented [] 2025-12-04T09:58:54.4783124Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4783224Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4783803Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4783857Z graph_break [] 2025-12-04T09:58:54.4783932Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4783972Z Autotune Choices Stats: 2025-12-04T09:58:54.4784737Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.4784865Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4784982Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4785144Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4785757Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4786407Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4787026Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4787627Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4788228Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4788874Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4789481Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4790080Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4790683Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4791296Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4791426Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.4791467Z Autotune Choices Stats: 2025-12-04T09:58:54.4792220Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.4792446Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4792610Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4792908Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4793541Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4794169Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4794790Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4795426Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4796076Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4796701Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4797364Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4797991Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4798615Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4799237Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4799388Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.4799462Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4799505Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4799541Z unimplemented [] 2025-12-04T09:58:54.4799602Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4799702Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4800271Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4800308Z graph_break [] 2025-12-04T09:58:54.4800383Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4800422Z Autotune Choices Stats: 2025-12-04T09:58:54.4801152Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.4801316Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4801430Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4801591Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4802199Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4802805Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4803412Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4804019Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4804621Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4805223Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4805852Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4806484Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4807085Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4807692Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4807837Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.4807877Z Autotune Choices Stats: 2025-12-04T09:58:54.4808633Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.4808851Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4809014Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4809306Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4809958Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4810575Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4811199Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4811827Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4812459Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4813083Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4813707Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4814365Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4814987Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4815612Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4815740Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.4815813Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4815871Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4815908Z unimplemented [] 2025-12-04T09:58:54.4816014Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4816113Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4816688Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4816726Z graph_break [] 2025-12-04T09:58:54.4816799Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4816839Z Autotune Choices Stats: 2025-12-04T09:58:54.4817583Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.4817734Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4817847Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4818011Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4818646Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4819248Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4819850Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4820450Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4821067Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4821671Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4822272Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4822900Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4823498Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4824102Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4824231Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.4824270Z Autotune Choices Stats: 2025-12-04T09:58:54.4825024Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.4825252Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4825416Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4825696Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4826363Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4827023Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4827646Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4828271Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4828899Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4829532Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4830161Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4830781Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4831443Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4832071Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4832200Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.4832273Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4832316Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4832352Z unimplemented [] 2025-12-04T09:58:54.4832413Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4832514Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4833087Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4833135Z graph_break [] 2025-12-04T09:58:54.4833208Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4833247Z Autotune Choices Stats: 2025-12-04T09:58:54.4833992Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.4834122Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4834235Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4834397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4835017Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4835645Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4836294Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4836899Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4837506Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4838125Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4838726Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4839328Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4839967Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4840567Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4840701Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.4840744Z Autotune Choices Stats: 2025-12-04T09:58:54.4841498Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.4841729Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4841896Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4842176Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4842813Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4843437Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4844088Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4844710Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4845342Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4846003Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4846638Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4847269Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4847899Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4848564Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4848696Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.4848773Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4848819Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4848859Z unimplemented [] 2025-12-04T09:58:54.4848921Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4849023Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4849599Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4849640Z graph_break [] 2025-12-04T09:58:54.4849716Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4849759Z Autotune Choices Stats: 2025-12-04T09:58:54.4850506Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.4850637Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4850756Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4850919Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4851527Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4852166Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4852769Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4853374Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4853986Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4854594Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4855210Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4855817Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4856458Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4857092Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4857222Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.4857266Z Autotune Choices Stats: 2025-12-04T09:58:54.4858020Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.4858239Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4858404Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4858694Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4859333Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4859959Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4860582Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4861237Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4861864Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4862492Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4863117Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4863757Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4864385Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4865024Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4865172Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.4865247Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4865290Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4865330Z unimplemented [] 2025-12-04T09:58:54.4865391Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4865492Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4866100Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4866142Z graph_break [] 2025-12-04T09:58:54.4866220Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4866261Z Autotune Choices Stats: 2025-12-04T09:58:54.4866998Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.4867148Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4867265Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4867426Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4868038Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4868642Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4869283Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4869888Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4870495Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4871096Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4871701Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4872303Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4872905Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4873538Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4873667Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.4873709Z Autotune Choices Stats: 2025-12-04T09:58:54.4874463Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.4874682Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4874845Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4875125Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4875756Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4876432Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4877054Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4877698Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4878348Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4878973Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4879599Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4880247Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4880874Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4881504Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4881646Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.4881723Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4881768Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4881808Z unimplemented [] 2025-12-04T09:58:54.4881870Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4881991Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4882568Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4882606Z graph_break [] 2025-12-04T09:58:54.4882682Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4882723Z Autotune Choices Stats: 2025-12-04T09:58:54.4883473Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.4883602Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4883719Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4883891Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4884500Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4885107Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4885711Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4886392Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4886997Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4887608Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4888212Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4888829Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4889429Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4890034Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4890174Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.4890215Z Autotune Choices Stats: 2025-12-04T09:58:54.4890996Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.4891212Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4891383Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4891663Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4892297Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4892934Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4893553Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4894176Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4894844Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4895470Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4896131Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4896750Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4897390Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4898011Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4898143Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.4898216Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4898259Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4898296Z unimplemented [] 2025-12-04T09:58:54.4898358Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4898471Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4899076Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4899114Z graph_break [] 2025-12-04T09:58:54.4899191Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4899230Z Autotune Choices Stats: 2025-12-04T09:58:54.4899963Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.4900095Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4900206Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4900367Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4900981Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4901595Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4902198Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4902800Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4903433Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4904039Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4904646Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4905251Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4905861Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4906498Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4906627Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.4906668Z Autotune Choices Stats: 2025-12-04T09:58:54.4907440Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.4907696Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4907864Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4908142Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4908773Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4909402Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4910037Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4910667Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4911302Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4911963Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4912582Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4913218Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4913838Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4914468Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4914599Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.4914674Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4914719Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4914757Z unimplemented [] 2025-12-04T09:58:54.4914821Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4914922Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4915498Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4915546Z graph_break [] 2025-12-04T09:58:54.4915621Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4915660Z Autotune Choices Stats: 2025-12-04T09:58:54.4916484Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.4916613Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4916727Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4916890Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4917497Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4918104Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4918724Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4919326Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4919925Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4920559Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4921164Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4921770Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4922373Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4922985Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4923117Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.4923157Z Autotune Choices Stats: 2025-12-04T09:58:54.4923917Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.4924141Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4924316Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4924617Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4925252Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4925879Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4926540Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4927183Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4927807Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4928431Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4929096Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4929718Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4930341Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4930962Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4931107Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.4931182Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4931226Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4931263Z unimplemented [] 2025-12-04T09:58:54.4931327Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4931426Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4932006Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4932047Z graph_break [] 2025-12-04T09:58:54.4932121Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4932166Z Autotune Choices Stats: 2025-12-04T09:58:54.4932898Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.4933059Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4933174Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4933337Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4933948Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4934549Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4935159Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4935771Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4936415Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4937019Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4937664Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4938261Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4938863Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4939468Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4939613Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.4939654Z Autotune Choices Stats: 2025-12-04T09:58:54.4940416Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.4940636Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4940800Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4941085Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4941748Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4942370Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4942994Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4943623Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4944262Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4944880Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4945509Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4946201Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4946822Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4947455Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4947587Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.4947661Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4947705Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4947755Z unimplemented [] 2025-12-04T09:58:54.4947819Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4947919Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4948497Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.4948537Z graph_break [] 2025-12-04T09:58:54.4948611Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4948654Z Autotune Choices Stats: 2025-12-04T09:58:54.4949399Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.4949545Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4949663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4949825Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4950468Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4951065Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4951671Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4952273Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4952884Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4953486Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4954093Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4954727Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4955328Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4955981Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4956111Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.4956154Z Autotune Choices Stats: 2025-12-04T09:58:54.4956921Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.4957157Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4957324Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4957605Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4958228Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4958892Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4959519Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4960147Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4960778Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4961412Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4962039Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4962667Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4963321Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4963944Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4964079Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.4964154Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4964196Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4964236Z unimplemented [] 2025-12-04T09:58:54.4964297Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4964401Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4964972Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4965024Z graph_break [] 2025-12-04T09:58:54.4965098Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4965143Z Autotune Choices Stats: 2025-12-04T09:58:54.4965887Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.4966064Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4966179Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4966340Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4966951Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4967597Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4968199Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4968799Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4969403Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4970020Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4970622Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4971226Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4971863Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4972466Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4972597Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.4972640Z Autotune Choices Stats: 2025-12-04T09:58:54.4973396Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.4973628Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4973797Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4974076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4974707Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4975330Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4976033Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4976655Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4977286Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4977919Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4978546Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4979176Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4979805Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4980464Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4980592Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.4980669Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4980712Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4980751Z unimplemented [] 2025-12-04T09:58:54.4980812Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4980914Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4981490Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4981529Z graph_break [] 2025-12-04T09:58:54.4981604Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4981645Z Autotune Choices Stats: 2025-12-04T09:58:54.4982388Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.4982524Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4982642Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4982801Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4983414Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4984030Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4984653Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4985249Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4985854Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4986501Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4987115Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4987719Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4988320Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4988956Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4989084Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.4989127Z Autotune Choices Stats: 2025-12-04T09:58:54.4989889Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.4990108Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4990276Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4990564Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4991195Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4991824Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4992447Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4993098Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4993725Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4994352Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4994973Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4995615Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.4996287Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4996907Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.4997081Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.4997161Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.4997203Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.4997243Z unimplemented [] 2025-12-04T09:58:54.4997303Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.4997406Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.4997986Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.4998025Z graph_break [] 2025-12-04T09:58:54.4998100Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.4998141Z Autotune Choices Stats: 2025-12-04T09:58:54.4998881Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.4999025Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.4999141Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.4999306Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.4999920Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5000522Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5001160Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5001761Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5002363Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5002965Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5003583Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5004186Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5004786Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5005397Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5005549Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.5005592Z Autotune Choices Stats: 2025-12-04T09:58:54.5006378Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.5006596Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5006762Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5007040Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5007688Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5008328Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5008953Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5009577Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5010241Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5010871Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5011498Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5012129Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5012764Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5013389Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5013529Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.5013604Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5013648Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5013687Z unimplemented [] 2025-12-04T09:58:54.5013751Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5013849Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5014447Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5014487Z graph_break [] 2025-12-04T09:58:54.5014562Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5014603Z Autotune Choices Stats: 2025-12-04T09:58:54.5015346Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.5015480Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5015594Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5015757Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5016455Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5017056Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5017665Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5018300Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5018897Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5019509Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5020115Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5020724Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5021326Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5021933Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5022072Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.5022111Z Autotune Choices Stats: 2025-12-04T09:58:54.5022899Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.5023115Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5023280Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5023565Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5024198Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5024827Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5025465Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5026136Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5026796Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5027421Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5028054Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5028677Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5029317Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5029935Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5030068Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.5030142Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5030186Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5030222Z unimplemented [] 2025-12-04T09:58:54.5030284Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5030385Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5030971Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5031029Z graph_break [] 2025-12-04T09:58:54.5031105Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5031145Z Autotune Choices Stats: 2025-12-04T09:58:54.5031891Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.5032022Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5032137Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5032297Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5032907Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5033523Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5034138Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5034743Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5035372Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5036009Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5036614Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5037216Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5037832Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5038437Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5038569Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.5038610Z Autotune Choices Stats: 2025-12-04T09:58:54.5039360Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.5039620Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5039791Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5040069Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5040699Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5041325Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5041967Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5042590Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5043218Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5043882Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5044505Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5045136Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5045761Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5046440Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5046571Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.5046645Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5046690Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5046729Z unimplemented [] 2025-12-04T09:58:54.5046789Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5046888Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5047466Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5047523Z graph_break [] 2025-12-04T09:58:54.5047596Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5047639Z Autotune Choices Stats: 2025-12-04T09:58:54.5048406Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.5048537Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5048652Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5048814Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5049434Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5050036Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5050647Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5051248Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5051855Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5052486Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5053087Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5053692Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5054292Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5054900Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5055032Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.5055072Z Autotune Choices Stats: 2025-12-04T09:58:54.5055833Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.5056083Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5056270Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5056576Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5057207Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5057835Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5058460Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5059091Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5059725Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5060349Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5061000Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5061629Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5062252Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5062868Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5063010Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.5063087Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5063129Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5063169Z unimplemented [] 2025-12-04T09:58:54.5063230Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5063331Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5063908Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5063948Z graph_break [] 2025-12-04T09:58:54.5064022Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5064066Z Autotune Choices Stats: 2025-12-04T09:58:54.5064805Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.5064970Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5065085Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5065248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5065869Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5066504Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5067109Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5067733Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5068336Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5068937Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5069572Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5070180Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5070784Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5071383Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5071534Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.5071576Z Autotune Choices Stats: 2025-12-04T09:58:54.5072337Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.5072557Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5072721Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5073000Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5073661Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5074286Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5074912Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5075535Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5076212Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5076839Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5077463Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5078129Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5078755Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5079378Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5079505Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.5079582Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5079625Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5079666Z unimplemented [] 2025-12-04T09:58:54.5079740Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5079842Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5080417Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5080458Z graph_break [] 2025-12-04T09:58:54.5080532Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5080575Z Autotune Choices Stats: 2025-12-04T09:58:54.5081315Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.5081444Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5081572Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5081734Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5082367Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5082970Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5083576Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5084180Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5084788Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5085390Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5086032Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5086680Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5087279Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5087886Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5088016Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.5088058Z Autotune Choices Stats: 2025-12-04T09:58:54.5088809Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.5089044Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5089209Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5089487Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5090121Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5090769Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5091388Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5092016Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5092648Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5093284Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5093919Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5094538Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5095197Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5095819Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5095975Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.5096057Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5096100Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5096142Z unimplemented [] 2025-12-04T09:58:54.5096203Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5096304Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5096880Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5096941Z graph_break [] 2025-12-04T09:58:54.5097016Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5097057Z Autotune Choices Stats: 2025-12-04T09:58:54.5098080Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.5098209Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5098323Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5098483Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5099089Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5099755Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5100358Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5100959Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5101560Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5102173Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5102780Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5103381Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5104027Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5104632Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5104763Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.5104806Z Autotune Choices Stats: 2025-12-04T09:58:54.5105568Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.5105798Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5106004Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5106285Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5106924Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5107558Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5108251Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5108871Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5109505Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5110135Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5110788Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5111419Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5112049Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5112712Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5112845Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.5112921Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5112966Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5113004Z unimplemented [] 2025-12-04T09:58:54.5113069Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5113170Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5113747Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5113785Z graph_break [] 2025-12-04T09:58:54.5113861Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5113900Z Autotune Choices Stats: 2025-12-04T09:58:54.5114634Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.5114776Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5114893Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5115063Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5115676Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5116332Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5116970Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5117572Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5118175Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5118786Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5119407Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5120017Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5120624Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5121266Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5121396Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.5121437Z Autotune Choices Stats: 2025-12-04T09:58:54.5122195Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.5122410Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5122580Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5122878Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5123508Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5124137Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5124767Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5125426Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5126090Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5126726Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5127356Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5128000Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5128625Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5129256Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5129432Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.5129507Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5129552Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5129589Z unimplemented [] 2025-12-04T09:58:54.5129653Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5129753Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5130325Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5130366Z graph_break [] 2025-12-04T09:58:54.5130443Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5130484Z Autotune Choices Stats: 2025-12-04T09:58:54.5131232Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.5131378Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5131492Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5131652Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5132256Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5132862Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5133498Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5134103Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5134707Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5135317Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5136001Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5136605Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5137209Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5137824Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5137982Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.5138023Z Autotune Choices Stats: 2025-12-04T09:58:54.5138774Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.5138995Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5139164Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5139439Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5140072Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5140711Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5141344Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5141961Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5142620Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5143254Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5143882Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5144510Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5145149Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5145775Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5145955Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.5146031Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5146077Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5146116Z unimplemented [] 2025-12-04T09:58:54.5146179Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5146279Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5146890Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5146931Z graph_break [] 2025-12-04T09:58:54.5147006Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5147048Z Autotune Choices Stats: 2025-12-04T09:58:54.5147784Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.5147913Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5148028Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5148190Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5148811Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5149412Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5150022Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5150658Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5151270Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5151873Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5152470Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5153089Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5153691Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5154293Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5154436Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.5154478Z Autotune Choices Stats: 2025-12-04T09:58:54.5155258Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.5155480Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5155651Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5155961Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5156592Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5157224Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5157866Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5158493Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5159163Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5159795Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5160421Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5161048Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5161697Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5162322Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5162454Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.5162529Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5162571Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5162612Z unimplemented [] 2025-12-04T09:58:54.5162673Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5162775Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5163386Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5163427Z graph_break [] 2025-12-04T09:58:54.5163500Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5163545Z Autotune Choices Stats: 2025-12-04T09:58:54.5164276Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:54.5164409Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5164524Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5164684Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5165301Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5165912Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5166547Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5167156Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5167795Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5168391Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5168996Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5169602Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5170218Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5170821Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5170953Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:54.5170996Z Autotune Choices Stats: 2025-12-04T09:58:54.5171752Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.5172004Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5172172Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5172448Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5173084Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5173717Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5174352Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5174971Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5175613Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5176304Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5176923Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5177559Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5178188Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5178829Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5178956Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:54.5179032Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5179077Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5179116Z unimplemented [] 2025-12-04T09:58:54.5179176Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5179284Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5179854Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5179909Z graph_break [] 2025-12-04T09:58:54.5179984Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5180026Z Autotune Choices Stats: 2025-12-04T09:58:54.5180780Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.5180909Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5181025Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5181189Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5181800Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5182404Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5183020Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5183635Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5184238Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5184870Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5185479Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5186115Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5186717Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5187331Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5187461Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:54.5187505Z Autotune Choices Stats: 2025-12-04T09:58:54.5188260Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.5188493Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5188661Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5188967Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5189603Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5190232Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5190855Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5191500Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5192130Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5192754Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5193415Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5194039Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5194670Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5195299Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5195438Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:54.5195513Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5195554Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5195592Z unimplemented [] 2025-12-04T09:58:54.5195651Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5195752Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5196358Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5196395Z graph_break [] 2025-12-04T09:58:54.5196471Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5196510Z Autotune Choices Stats: 2025-12-04T09:58:54.5197254Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:54.5197423Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5197538Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5197696Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5198308Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5198908Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5199521Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5200134Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5200736Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5201344Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5201977Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5202580Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5203190Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5203793Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5203931Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:54.5203972Z Autotune Choices Stats: 2025-12-04T09:58:54.5204729Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.5204948Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5205113Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5205398Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5206091Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5206722Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5207346Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5207969Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5208609Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5209237Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5209860Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5210525Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5211151Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5211777Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5211907Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:54.5211982Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5212034Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5212071Z unimplemented [] 2025-12-04T09:58:54.5212132Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5212232Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5212808Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5212845Z graph_break [] 2025-12-04T09:58:54.5212922Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5212962Z Autotune Choices Stats: 2025-12-04T09:58:54.5213699Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:54.5213838Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5213949Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5214115Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5214746Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5217364Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5217975Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5218585Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5219210Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5219820Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5220429Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5221067Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5221669Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5222278Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5222410Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:54.5222451Z Autotune Choices Stats: 2025-12-04T09:58:54.5223216Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.5223435Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5223602Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5223885Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5224516Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5225176Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5225797Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5226465Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5227092Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5227740Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5228377Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5229000Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5229656Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5230284Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5230415Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:54.5230489Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5230533Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5230572Z unimplemented [] 2025-12-04T09:58:54.5230634Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5230736Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5231313Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5231364Z graph_break [] 2025-12-04T09:58:54.5231438Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5231479Z Autotune Choices Stats: 2025-12-04T09:58:54.5232225Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:54.5232356Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5232470Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5232632Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5233265Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5233871Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5234476Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5235079Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5235684Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5236335Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5236939Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5237539Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5238184Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5238788Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5238917Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:54.5238957Z Autotune Choices Stats: 2025-12-04T09:58:54.5239722Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.5239962Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5240126Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5240402Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5241027Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5241653Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5242314Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5242939Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5243570Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5244208Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5244841Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5245471Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5246130Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5246803Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5246931Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:54.5247024Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.5247076Z Traceback (most recent call last): 2025-12-04T09:58:54.5247228Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.5247272Z self.assertTrue( 2025-12-04T09:58:54.5247375Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.5247426Z raise self.failureException(msg) 2025-12-04T09:58:54.5247552Z AssertionError: False is not true : Log file /tmp/tmp1l51gxyl/flex_attention_configs.json was not created 2025-12-04T09:58:54.5247555Z 2025-12-04T09:58:54.5247635Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.5247801Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.5247805Z 2025-12-04T09:58:54.5247895Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.5247982Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5248025Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5248062Z unimplemented [] 2025-12-04T09:58:54.5248126Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5248704Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.5248804Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5248841Z graph_break [] 2025-12-04T09:58:54.5248916Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5249410Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.5249458Z current_size = base.storage().size() 2025-12-04T09:58:54.5249498Z Autotune Choices Stats: 2025-12-04T09:58:54.5250265Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.5250405Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5250519Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5250676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5251290Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5251896Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5252498Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5253107Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5253707Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5254314Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5254936Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5255533Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5256167Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5256762Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5256906Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.5256948Z Autotune Choices Stats: 2025-12-04T09:58:54.5257699Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.5257922Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5258089Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5258377Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5259030Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5259651Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5260274Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5260897Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5261532Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5262158Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5262778Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5263430Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5264047Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5264671Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5264798Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.5264873Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5264924Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5264962Z unimplemented [] 2025-12-04T09:58:54.5265023Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5265124Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5265698Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5265736Z graph_break [] 2025-12-04T09:58:54.5265809Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5265849Z Autotune Choices Stats: 2025-12-04T09:58:54.5266624Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.5266765Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5266879Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5267039Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5267682Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5268286Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5268883Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5269482Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5270100Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5270701Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5271306Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5271937Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5272529Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5273132Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5273262Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.5273304Z Autotune Choices Stats: 2025-12-04T09:58:54.5274072Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.5274287Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5274452Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5274732Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5275368Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5276061Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5276683Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5277314Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5277944Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5278579Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5279202Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5279827Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5280475Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5281099Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5281225Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.5281301Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5281344Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5281382Z unimplemented [] 2025-12-04T09:58:54.5281443Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5281544Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5282119Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5282166Z graph_break [] 2025-12-04T09:58:54.5282239Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5282280Z Autotune Choices Stats: 2025-12-04T09:58:54.5283009Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.5283140Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5283256Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5283417Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5284053Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5284653Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5285258Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5285857Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5286500Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5287115Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5287725Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5288324Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5288954Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5289565Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5289694Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.5289734Z Autotune Choices Stats: 2025-12-04T09:58:54.5290493Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.5290728Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5290892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5291176Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5291805Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5292427Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5293079Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5293705Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5294331Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5294957Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5295599Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5296260Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5296881Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5297550Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5297678Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.5297751Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5297795Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5297831Z unimplemented [] 2025-12-04T09:58:54.5297892Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5297992Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5298567Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5298604Z graph_break [] 2025-12-04T09:58:54.5298677Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5298733Z Autotune Choices Stats: 2025-12-04T09:58:54.5299477Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.5299606Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5299719Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5299881Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5300493Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5301122Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5301726Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5302327Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5302928Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5303547Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5304147Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5304749Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5305355Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5306043Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5306173Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.5306213Z Autotune Choices Stats: 2025-12-04T09:58:54.5306968Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.5307184Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5307351Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5307641Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5308269Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5308894Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5309519Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5310167Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5310793Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5311419Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5312047Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5312682Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5313306Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5313963Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5314093Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.5314166Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5314208Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5314244Z unimplemented [] 2025-12-04T09:58:54.5314305Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5314405Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5314979Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5315016Z graph_break [] 2025-12-04T09:58:54.5315092Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5315131Z Autotune Choices Stats: 2025-12-04T09:58:54.5315864Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.5316033Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5316146Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5316307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5316916Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5317534Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5318172Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5318771Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5319370Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5319978Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5320596Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5321197Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5321801Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5322440Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5322568Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.5322608Z Autotune Choices Stats: 2025-12-04T09:58:54.5323364Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.5323584Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5323750Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5324029Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5324663Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5325297Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5325976Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5326634Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5327262Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5327894Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5328522Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5329164Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5329790Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5330414Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5330552Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.5330625Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5330668Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5330704Z unimplemented [] 2025-12-04T09:58:54.5330765Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5330886Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5331458Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5331496Z graph_break [] 2025-12-04T09:58:54.5331570Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5331609Z Autotune Choices Stats: 2025-12-04T09:58:54.5332351Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.5332479Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5332592Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5332764Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5333371Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5333978Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5334585Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5335218Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5335817Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5336443Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5337042Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5337657Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5338266Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5338864Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5339013Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.5339053Z Autotune Choices Stats: 2025-12-04T09:58:54.5339834Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.5340053Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5340216Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5340495Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5341125Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5341760Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5342386Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5343012Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5343676Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5344300Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5344926Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5345559Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5346230Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5346854Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5346987Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.5347063Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5347104Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5347143Z unimplemented [] 2025-12-04T09:58:54.5347202Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5347322Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5347918Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5347958Z graph_break [] 2025-12-04T09:58:54.5348031Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5348075Z Autotune Choices Stats: 2025-12-04T09:58:54.5348812Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.5348943Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5349058Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5349219Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5349830Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5350446Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5351047Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5351653Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5352277Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5352877Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5353486Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5354087Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5354708Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5355309Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5355440Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.5355482Z Autotune Choices Stats: 2025-12-04T09:58:54.5356264Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.5356523Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5356690Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5356970Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5357610Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5358236Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5358873Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5359501Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5360128Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5360780Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5361400Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5362032Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5362657Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5363293Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5363421Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.5363496Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5363540Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5363578Z unimplemented [] 2025-12-04T09:58:54.5363639Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5363742Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5364314Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5364363Z graph_break [] 2025-12-04T09:58:54.5364438Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5364481Z Autotune Choices Stats: 2025-12-04T09:58:54.5365246Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.5365372Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5365488Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5365650Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5366296Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5366899Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5367519Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5368123Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5368723Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5369371Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5369965Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5370571Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5371172Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5371786Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5371914Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.5371957Z Autotune Choices Stats: 2025-12-04T09:58:54.5372714Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.5372944Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5373114Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5373411Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5374041Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5374671Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5375293Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5375962Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5376589Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5377219Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5377887Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5378512Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5379147Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5379778Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5379925Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.5380001Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5380042Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5380078Z unimplemented [] 2025-12-04T09:58:54.5380139Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5380240Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5380808Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5380849Z graph_break [] 2025-12-04T09:58:54.5380923Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5380963Z Autotune Choices Stats: 2025-12-04T09:58:54.5381716Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.5381874Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5381989Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5382149Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5382763Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5383374Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5383979Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5384589Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5385206Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5385815Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5386496Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5387093Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5387701Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5388298Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5391667Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.5391708Z Autotune Choices Stats: 2025-12-04T09:58:54.5392464Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.5392684Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5392852Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5393145Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5393792Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5394412Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5395046Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5395667Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5396350Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5396981Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5397611Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5398274Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5398906Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5399531Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5399660Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.5399738Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5399799Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5399836Z unimplemented [] 2025-12-04T09:58:54.5399896Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5399998Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5400576Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5400613Z graph_break [] 2025-12-04T09:58:54.5400689Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5400730Z Autotune Choices Stats: 2025-12-04T09:58:54.5401471Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.5401608Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5401721Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5401883Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5402513Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5403111Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5403708Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5404311Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5404916Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5405520Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5406155Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5406809Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5407410Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5408017Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5408147Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.5408188Z Autotune Choices Stats: 2025-12-04T09:58:54.5408950Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.5409181Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5409348Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5409629Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5410265Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5410914Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5411535Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5412160Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5412788Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5413421Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5414051Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5414674Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5415325Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5415989Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5416123Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.5416198Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5416242Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5416279Z unimplemented [] 2025-12-04T09:58:54.5416341Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5416441Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5417025Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5417074Z graph_break [] 2025-12-04T09:58:54.5417149Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5417190Z Autotune Choices Stats: 2025-12-04T09:58:54.5417933Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.5418064Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5418176Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5418338Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5418955Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5419584Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5420182Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5420783Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5421383Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5422004Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5422609Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5423215Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5423852Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5424458Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5424589Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.5424630Z Autotune Choices Stats: 2025-12-04T09:58:54.5425384Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.5425613Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5425785Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5426098Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5426729Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5427353Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5428017Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5428641Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5429270Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5429900Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5430537Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5431166Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5431794Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5432456Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5432587Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.5432661Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5432706Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5432745Z unimplemented [] 2025-12-04T09:58:54.5432805Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5432906Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5433482Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5433520Z graph_break [] 2025-12-04T09:58:54.5433595Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5433637Z Autotune Choices Stats: 2025-12-04T09:58:54.5434391Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.5434519Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5434631Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5434794Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5435411Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5436087Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5436688Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5437292Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5437897Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5438504Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5439116Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5439720Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5440319Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5440947Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5441077Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.5441119Z Autotune Choices Stats: 2025-12-04T09:58:54.5441879Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.5442097Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5442263Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5442557Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5443190Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5443820Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5444444Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5445106Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5445728Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5446387Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5447011Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5447657Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5448290Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5448911Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5449087Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.5449165Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5449211Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5449249Z unimplemented [] 2025-12-04T09:58:54.5449311Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5449410Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5449992Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5450030Z graph_break [] 2025-12-04T09:58:54.5450105Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5450146Z Autotune Choices Stats: 2025-12-04T09:58:54.5450875Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.5451016Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5451128Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5451290Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5451901Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5452500Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5453133Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5453730Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5454337Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5454938Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5455569Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5456204Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5456805Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5457415Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5457571Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.5457613Z Autotune Choices Stats: 2025-12-04T09:58:54.5458368Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.5458588Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5458751Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5459029Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5459669Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5460297Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5460922Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5461550Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5462211Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5462836Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5463464Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5464093Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5464730Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5465352Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5465494Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.5465573Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5465616Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5465657Z unimplemented [] 2025-12-04T09:58:54.5465717Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5465844Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5466449Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5466489Z graph_break [] 2025-12-04T09:58:54.5466562Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5466604Z Autotune Choices Stats: 2025-12-04T09:58:54.5468537Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.5468672Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5468788Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5468972Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5469597Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5470202Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5470812Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5471435Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5472042Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5472649Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5473298Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5473912Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5474513Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5475121Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5475263Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.5475304Z Autotune Choices Stats: 2025-12-04T09:58:54.5476125Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.5476343Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5476508Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5476791Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5477446Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5478069Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5478707Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5479336Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5479985Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5480611Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5481238Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5481878Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5482509Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5483134Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5483267Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.5483345Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5483387Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5483427Z unimplemented [] 2025-12-04T09:58:54.5483489Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5483593Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5484184Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5484226Z graph_break [] 2025-12-04T09:58:54.5484300Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5484343Z Autotune Choices Stats: 2025-12-04T09:58:54.5485079Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.5485211Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5485327Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5485487Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5486155Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5486773Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5487379Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5487983Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5488611Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5489211Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5489815Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5490430Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5491048Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5491653Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5491783Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.5491830Z Autotune Choices Stats: 2025-12-04T09:58:54.5492582Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.5492823Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5492988Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5493271Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5493898Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5494535Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5495162Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5495783Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5496435Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5497095Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5497715Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5498345Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5498989Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5499625Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5499755Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.5499832Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5499875Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5499916Z unimplemented [] 2025-12-04T09:58:54.5499976Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5500076Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5500656Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5500713Z graph_break [] 2025-12-04T09:58:54.5500786Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5500827Z Autotune Choices Stats: 2025-12-04T09:58:54.5501586Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.5501716Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5501833Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5501992Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5502623Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5503233Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5503844Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5504453Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5505055Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5505682Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5506338Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5506943Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5507557Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5508175Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5508304Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.5508345Z Autotune Choices Stats: 2025-12-04T09:58:54.5509103Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.5509321Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5509500Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5509789Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5510426Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5511053Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5511690Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5512321Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5512959Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5513589Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5514234Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5514870Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5515497Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5516169Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5516310Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.5516390Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5516432Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5516475Z unimplemented [] 2025-12-04T09:58:54.5516537Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5516639Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5517218Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5517258Z graph_break [] 2025-12-04T09:58:54.5517335Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5517374Z Autotune Choices Stats: 2025-12-04T09:58:54.5518111Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.5518270Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5518388Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5518547Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5519159Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5519775Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5520379Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5520992Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5521600Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5522204Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5522831Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5523429Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5524044Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5524647Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5524787Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.5524829Z Autotune Choices Stats: 2025-12-04T09:58:54.5525584Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.5525804Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5526001Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5526279Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5526945Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5527570Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5528189Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5528830Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5529471Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5530100Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5530728Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5531364Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5531984Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5532611Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5532738Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.5532813Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5532854Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5532908Z unimplemented [] 2025-12-04T09:58:54.5532968Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5533066Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5533642Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5533679Z graph_break [] 2025-12-04T09:58:54.5533754Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5533794Z Autotune Choices Stats: 2025-12-04T09:58:54.5534534Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.5534662Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5534789Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5534948Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5535569Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5536189Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5536811Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5537411Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5538027Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5538632Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5539237Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5539863Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5540463Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5541088Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5541219Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.5541260Z Autotune Choices Stats: 2025-12-04T09:58:54.5542012Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.5542240Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5542406Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5542687Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5543319Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5543962Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5544588Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5545228Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5545857Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5546540Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5547167Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5547791Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5548442Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5549072Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5549206Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.5549279Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5549323Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5549361Z unimplemented [] 2025-12-04T09:58:54.5549424Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5549542Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5550119Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5550171Z graph_break [] 2025-12-04T09:58:54.5550246Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5550287Z Autotune Choices Stats: 2025-12-04T09:58:54.5551032Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.5551165Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5551278Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5551440Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5552053Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5552678Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5553283Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5553892Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5554494Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5555112Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5555720Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5556340Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5556963Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5557563Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5557692Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.5557733Z Autotune Choices Stats: 2025-12-04T09:58:54.5558499Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.5558731Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5558897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5559180Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5559811Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5560440Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5561084Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5561707Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5562343Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5562975Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5563608Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5564233Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5564861Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5565519Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5565649Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.5565723Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5565768Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5565807Z unimplemented [] 2025-12-04T09:58:54.5565868Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5565986Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5566581Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5566618Z graph_break [] 2025-12-04T09:58:54.5566694Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5566735Z Autotune Choices Stats: 2025-12-04T09:58:54.5567488Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.5567630Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5567743Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5567906Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5568522Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5569132Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5569753Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5570356Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5570971Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5571575Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5572191Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5572801Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5573398Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5574023Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5574152Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.5574192Z Autotune Choices Stats: 2025-12-04T09:58:54.5574946Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.5575175Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5575339Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5575625Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5576292Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5576927Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5577550Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5578202Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5578831Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5579470Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5580093Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5580731Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5581361Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5581982Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5582138Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.5582213Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5582257Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5582295Z unimplemented [] 2025-12-04T09:58:54.5582360Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5582460Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5583045Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5583087Z graph_break [] 2025-12-04T09:58:54.5583162Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5583203Z Autotune Choices Stats: 2025-12-04T09:58:54.5583952Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.5584094Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5584207Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5584369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5584990Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5585590Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5586262Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5586863Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5587467Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5588077Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5588679Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5589291Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5589893Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5590492Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5590646Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.5590686Z Autotune Choices Stats: 2025-12-04T09:58:54.5591445Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.5591666Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5591830Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5592121Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5592753Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5593389Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5594009Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5594631Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5595283Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5595907Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5596591Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5597210Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5597856Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5598478Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5598623Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.5598699Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5598740Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5598782Z unimplemented [] 2025-12-04T09:58:54.5598843Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5598944Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5599530Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5599572Z graph_break [] 2025-12-04T09:58:54.5599647Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5599690Z Autotune Choices Stats: 2025-12-04T09:58:54.5600435Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.5600574Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5600694Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5600855Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5601478Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5602075Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5602677Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5603300Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5603902Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5604505Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5605125Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5605733Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5606361Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5606967Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5607115Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.5607155Z Autotune Choices Stats: 2025-12-04T09:58:54.5607918Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.5608136Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5608300Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5608574Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5609232Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5609860Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5610492Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5611119Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5611768Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5612395Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5613013Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5613671Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5614309Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5614935Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5615065Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.5615143Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5615185Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5615223Z unimplemented [] 2025-12-04T09:58:54.5615286Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5615390Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5616003Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5616064Z graph_break [] 2025-12-04T09:58:54.5616140Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5616183Z Autotune Choices Stats: 2025-12-04T09:58:54.5617095Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.5617229Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5617346Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5617504Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5618130Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5618751Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5619358Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5619960Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5620586Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5621190Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5621797Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5622406Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5623017Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5623624Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5623754Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.5623796Z Autotune Choices Stats: 2025-12-04T09:58:54.5624547Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.5624786Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5624955Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5625234Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5625875Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5626562Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5627194Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5627828Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5628457Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5629110Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5629736Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5630368Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5631003Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5631644Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5631773Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.5631851Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5631895Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5631936Z unimplemented [] 2025-12-04T09:58:54.5631998Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5632103Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5632678Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5632726Z graph_break [] 2025-12-04T09:58:54.5632800Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5632841Z Autotune Choices Stats: 2025-12-04T09:58:54.5633594Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.5633723Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5633843Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5634002Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5634626Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5635232Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5635857Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5636487Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5637096Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5637726Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5638325Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5638927Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5639544Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5640158Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5640287Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.5640333Z Autotune Choices Stats: 2025-12-04T09:58:54.5641086Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.5641301Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5641483Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5641772Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5642405Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5643032Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5643671Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5644303Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5644936Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5645563Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5646283Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5646912Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5647536Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5648176Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5648322Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.5648396Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5648440Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5648479Z unimplemented [] 2025-12-04T09:58:54.5648540Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5648640Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5649219Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5649257Z graph_break [] 2025-12-04T09:58:54.5649332Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5649372Z Autotune Choices Stats: 2025-12-04T09:58:54.5650125Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.5650280Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5650392Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5650552Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5651162Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5651869Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5652472Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5653087Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5653697Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5654299Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5654917Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5655526Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5656181Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5656787Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5656934Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.5656974Z Autotune Choices Stats: 2025-12-04T09:58:54.5657728Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.5657946Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5658110Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5658392Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5659055Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5659680Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5660305Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5660937Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5661583Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5662214Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5662839Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5663493Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5664121Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5664757Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5664888Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.5664963Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5665008Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5665057Z unimplemented [] 2025-12-04T09:58:54.5665120Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5665220Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5665798Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5665838Z graph_break [] 2025-12-04T09:58:54.5665914Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5666011Z Autotune Choices Stats: 2025-12-04T09:58:54.5666753Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:54.5666899Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5667017Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5667182Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5667802Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5668408Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5669028Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5669638Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5670249Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5670856Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5671467Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5672091Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5672686Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5673306Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5673440Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:54.5673481Z Autotune Choices Stats: 2025-12-04T09:58:54.5674245Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.5674472Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5674637Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5674916Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5675555Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5676249Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5676873Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5677519Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5678148Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5678799Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5679425Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5680056Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5680701Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5681327Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5681461Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:54.5681537Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5681582Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5681619Z unimplemented [] 2025-12-04T09:58:54.5681682Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5681793Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5682371Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5682421Z graph_break [] 2025-12-04T09:58:54.5682495Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5682537Z Autotune Choices Stats: 2025-12-04T09:58:54.5683270Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.5683400Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5683513Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5683675Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5684293Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5684910Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5685528Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5686172Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5686775Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5687393Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5688002Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5688604Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5689229Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5689827Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5689957Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:54.5689997Z Autotune Choices Stats: 2025-12-04T09:58:54.5690762Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.5690991Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5691155Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5691438Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5692070Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5692693Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5693348Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5693978Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5694615Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5695246Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5695885Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5696555Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5697182Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5697839Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5697971Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:54.5698047Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5698088Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5698126Z unimplemented [] 2025-12-04T09:58:54.5698189Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5698291Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5698886Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5698926Z graph_break [] 2025-12-04T09:58:54.5699000Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5699041Z Autotune Choices Stats: 2025-12-04T09:58:54.5699795Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:54.5699924Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5700041Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5700203Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5700812Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5701443Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5702048Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5702652Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5703265Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5703882Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5704487Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5705101Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5705719Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5706392Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5706522Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:54.5706566Z Autotune Choices Stats: 2025-12-04T09:58:54.5707341Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.5707562Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5707725Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5708016Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5708648Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5709277Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5709901Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5710558Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5711186Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5711826Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5712452Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5713088Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5713725Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5714370Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5714498Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:54.5714576Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5714617Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5714655Z unimplemented [] 2025-12-04T09:58:54.5714717Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5714819Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5715405Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5715442Z graph_break [] 2025-12-04T09:58:54.5715515Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5715556Z Autotune Choices Stats: 2025-12-04T09:58:54.5716404Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:54.5716547Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5716663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5716826Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5717438Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5718039Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5718659Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5719267Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5719873Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5720496Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5721109Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5721719Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5722324Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5722946Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5723076Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:54.5723117Z Autotune Choices Stats: 2025-12-04T09:58:54.5723882Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.5724103Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5724281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5724558Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5725189Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5725832Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5726506Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5727159Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5730090Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5730733Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5731390Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5732041Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5732670Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5733298Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5733441Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:54.5733521Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5733565Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5733603Z unimplemented [] 2025-12-04T09:58:54.5733666Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5733783Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5734361Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5734399Z graph_break [] 2025-12-04T09:58:54.5734475Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5734517Z Autotune Choices Stats: 2025-12-04T09:58:54.5735270Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:54.5735399Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5735515Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5735686Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5736358Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5736961Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5737570Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5738200Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5738805Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5739409Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5740027Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5740646Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5741246Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5741848Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5741988Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:54.5742029Z Autotune Choices Stats: 2025-12-04T09:58:54.5742801Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.5743019Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5743185Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5743463Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5744122Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5744756Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5745381Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5746054Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5746705Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5747334Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5747958Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5748610Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5749251Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5749875Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5750005Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:54.5750082Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5750124Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5750162Z unimplemented [] 2025-12-04T09:58:54.5750225Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5750338Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5750917Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5750955Z graph_break [] 2025-12-04T09:58:54.5751031Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5751071Z Autotune Choices Stats: 2025-12-04T09:58:54.5751805Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.5751935Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5752056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5752235Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5752841Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5753451Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5754058Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5754661Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5755281Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5755883Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5756529Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5757146Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5757762Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5758373Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5758502Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:54.5758544Z Autotune Choices Stats: 2025-12-04T09:58:54.5759303Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.5759559Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5759725Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5760002Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5760640Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5761274Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5761909Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5762540Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5763170Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5763815Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5764445Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5765073Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5765706Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5766370Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5766500Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:54.5766592Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.5766641Z Traceback (most recent call last): 2025-12-04T09:58:54.5766796Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.5766838Z self.assertTrue( 2025-12-04T09:58:54.5766943Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.5766992Z raise self.failureException(msg) 2025-12-04T09:58:54.5767119Z AssertionError: False is not true : Log file /tmp/tmp3xyvfcnn/flex_attention_configs.json was not created 2025-12-04T09:58:54.5767123Z 2025-12-04T09:58:54.5767199Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.5767388Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.5767391Z 2025-12-04T09:58:54.5767482Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.5767559Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5767602Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5767639Z unimplemented [] 2025-12-04T09:58:54.5767716Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5768296Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.5768397Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5768434Z graph_break [] 2025-12-04T09:58:54.5768508Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5769004Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.5769052Z current_size = base.storage().size() 2025-12-04T09:58:54.5769094Z Autotune Choices Stats: 2025-12-04T09:58:54.5769850Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.5769991Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5770108Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5770269Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5770880Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5771484Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5772105Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5772706Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5773308Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5773914Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5774528Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5775129Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5775728Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5776396Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5776528Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.5776568Z Autotune Choices Stats: 2025-12-04T09:58:54.5777342Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.5777562Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5777741Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5778018Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5778666Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5779288Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5779912Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5780555Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5781178Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5781801Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5782436Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5783069Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5783692Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5784321Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5784462Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.5784537Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5784580Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5784617Z unimplemented [] 2025-12-04T09:58:54.5784679Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5784791Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5785370Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5785408Z graph_break [] 2025-12-04T09:58:54.5785481Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5785523Z Autotune Choices Stats: 2025-12-04T09:58:54.5786313Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.5786441Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5786555Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5786728Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5787338Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5787948Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5788547Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5789180Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5789786Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5790383Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5790991Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5791606Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5792206Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5792803Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5792942Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.5792983Z Autotune Choices Stats: 2025-12-04T09:58:54.5793747Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.5793966Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5794131Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5794408Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5795050Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5795683Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5796349Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5796973Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5797630Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5798262Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5798880Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5799521Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5800161Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5800783Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5800911Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.5800986Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5801027Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5801064Z unimplemented [] 2025-12-04T09:58:54.5801125Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5801238Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5801826Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5801864Z graph_break [] 2025-12-04T09:58:54.5801938Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5801978Z Autotune Choices Stats: 2025-12-04T09:58:54.5802717Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.5802848Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5802964Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5803131Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5803747Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5804356Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5804956Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5805562Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5806237Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5806834Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5807441Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5808061Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5808679Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5809278Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5809407Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.5809447Z Autotune Choices Stats: 2025-12-04T09:58:54.5810203Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.5810444Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5810611Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5810890Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5811524Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5812160Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5812793Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5813416Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5814048Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5814700Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5815329Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5815988Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5816645Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5817283Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5817411Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.5817485Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5817528Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5817565Z unimplemented [] 2025-12-04T09:58:54.5817624Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5817725Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5818312Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5818362Z graph_break [] 2025-12-04T09:58:54.5818438Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5818480Z Autotune Choices Stats: 2025-12-04T09:58:54.5819236Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.5819363Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5819476Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5819637Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5820265Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5820861Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5821478Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5822081Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5822683Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5823313Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5823910Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5824525Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5825116Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5825737Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5825866Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.5825908Z Autotune Choices Stats: 2025-12-04T09:58:54.5826712Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.5826947Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5827111Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5827406Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5828043Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5828666Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5829303Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5829945Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5830576Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5831203Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5831844Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5832472Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5833105Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5833724Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5833864Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.5833938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5833980Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5834017Z unimplemented [] 2025-12-04T09:58:54.5834080Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5834180Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5834755Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5834791Z graph_break [] 2025-12-04T09:58:54.5834865Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5834904Z Autotune Choices Stats: 2025-12-04T09:58:54.5835645Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.5835794Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5835906Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5836099Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5836714Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5837355Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5837957Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5838569Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5839169Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5839773Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5840412Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5841013Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5841625Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5842226Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5842365Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.5842404Z Autotune Choices Stats: 2025-12-04T09:58:54.5843173Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.5843391Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5843557Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5843929Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5844570Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5845201Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5845837Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5846504Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5847158Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5847782Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5848406Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5849060Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5849686Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5850325Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5850454Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.5850527Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5850580Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5850616Z unimplemented [] 2025-12-04T09:58:54.5850677Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5850776Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5851358Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5851395Z graph_break [] 2025-12-04T09:58:54.5851471Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5851511Z Autotune Choices Stats: 2025-12-04T09:58:54.5852251Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.5852392Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5852505Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5852670Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5853304Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5853909Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5854523Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5855125Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5855740Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5856383Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5856980Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5857621Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5858227Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5858852Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5858982Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.5859023Z Autotune Choices Stats: 2025-12-04T09:58:54.5859798Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.5860017Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5860181Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5860459Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5861087Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5861735Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5862359Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5862995Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5863617Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5864254Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5864876Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5865503Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5866205Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5866831Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5866963Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.5867036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5867081Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5867118Z unimplemented [] 2025-12-04T09:58:54.5867201Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5867300Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5867884Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5867935Z graph_break [] 2025-12-04T09:58:54.5868008Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5868050Z Autotune Choices Stats: 2025-12-04T09:58:54.5868795Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.5868927Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5869040Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5869203Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5869837Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5870441Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5871052Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5871666Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5872275Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5872894Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5873497Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5874098Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5874718Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5875325Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5875454Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.5875494Z Autotune Choices Stats: 2025-12-04T09:58:54.5876312Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.5876542Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5876705Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5876986Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5877622Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5878238Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5878887Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5879511Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5880155Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5880783Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5881414Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5882043Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5882671Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5883317Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5883447Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.5883522Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5883566Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5883602Z unimplemented [] 2025-12-04T09:58:54.5883666Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5883766Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5884366Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5884407Z graph_break [] 2025-12-04T09:58:54.5884479Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5884535Z Autotune Choices Stats: 2025-12-04T09:58:54.5885272Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.5885402Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5885518Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5885681Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5886342Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5886978Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5887579Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5888189Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5888800Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5889405Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5890010Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5890610Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5891219Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5891831Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5891961Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.5892001Z Autotune Choices Stats: 2025-12-04T09:58:54.5892770Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.5892986Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5893149Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5893438Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5894074Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5894699Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5895322Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5896026Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5896646Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5897293Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5897926Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5898567Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5899191Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5899826Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5899956Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.5900032Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5900074Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5900112Z unimplemented [] 2025-12-04T09:58:54.5900172Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5900277Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5900851Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5900889Z graph_break [] 2025-12-04T09:58:54.5900963Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5901005Z Autotune Choices Stats: 2025-12-04T09:58:54.5901746Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.5901886Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5902003Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5902165Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5902778Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5903381Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5904010Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5904611Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5905227Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5905840Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5906488Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5907084Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5907687Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5908315Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5908444Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.5908485Z Autotune Choices Stats: 2025-12-04T09:58:54.5909239Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.5909458Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5909640Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5909914Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5910547Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5911195Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5911817Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5912465Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5913091Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5913713Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5914347Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5914979Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5915602Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5916273Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5916418Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.5916491Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5916533Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5916571Z unimplemented [] 2025-12-04T09:58:54.5916630Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5916743Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5917317Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5917357Z graph_break [] 2025-12-04T09:58:54.5917429Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5917469Z Autotune Choices Stats: 2025-12-04T09:58:54.5918232Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.5918360Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5918473Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5918658Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5919265Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5919869Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5920470Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5921087Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5921687Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5922287Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5922899Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5923507Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5924107Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5924714Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5924855Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.5924898Z Autotune Choices Stats: 2025-12-04T09:58:54.5925665Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.5925883Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5926086Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5926365Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5927019Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5927656Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5928276Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5928905Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5929555Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5930185Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5930807Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5931449Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5932081Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5932707Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5932834Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.5932910Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5932953Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5932993Z unimplemented [] 2025-12-04T09:58:54.5933054Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5933167Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5933754Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5933793Z graph_break [] 2025-12-04T09:58:54.5933868Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5933907Z Autotune Choices Stats: 2025-12-04T09:58:54.5934648Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.5934776Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5934890Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5935068Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5935678Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5936345Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5936945Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5937541Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5938168Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5938771Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5939376Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5939991Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5940602Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5941207Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5941336Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.5941379Z Autotune Choices Stats: 2025-12-04T09:58:54.5942134Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.5942373Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5942540Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5942817Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5943452Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5944093Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5944729Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5945348Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5946012Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5946668Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5947284Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5947910Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5948545Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5949188Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5949316Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.5949388Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5949434Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5949471Z unimplemented [] 2025-12-04T09:58:54.5949534Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5949636Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5950218Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5950265Z graph_break [] 2025-12-04T09:58:54.5950339Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5950379Z Autotune Choices Stats: 2025-12-04T09:58:54.5951129Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.5951258Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5951374Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5951537Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5952168Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5952772Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5953393Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5953993Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5954590Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5955214Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5955818Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5956457Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5957066Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5957676Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5957808Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.5957849Z Autotune Choices Stats: 2025-12-04T09:58:54.5958607Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.5958839Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5959005Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5959296Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5959931Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5960562Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5961197Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5961833Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5962457Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5963085Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5963724Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5964353Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5964981Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5965616Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5965754Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.5965828Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5965872Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5965909Z unimplemented [] 2025-12-04T09:58:54.5966010Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5966110Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5966690Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.5966728Z graph_break [] 2025-12-04T09:58:54.5966804Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5966844Z Autotune Choices Stats: 2025-12-04T09:58:54.5967581Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.5967759Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5967872Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5968035Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5968645Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5969266Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5969867Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5970481Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5971081Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5971683Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5972313Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5972920Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5973534Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5974137Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5974276Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.5974317Z Autotune Choices Stats: 2025-12-04T09:58:54.5975084Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.5975302Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5975467Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5975755Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5976434Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5977063Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5977684Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5978320Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5978955Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5979584Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5980204Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5980859Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5981483Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5982113Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5982244Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.5982319Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5982380Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5982418Z unimplemented [] 2025-12-04T09:58:54.5982482Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5982582Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5983161Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5983203Z graph_break [] 2025-12-04T09:58:54.5983277Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5983321Z Autotune Choices Stats: 2025-12-04T09:58:54.5984072Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.5984214Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5984327Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5984491Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5985117Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5985717Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5986377Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5986975Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5987592Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5988192Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5988800Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5989436Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5990036Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5990652Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5990783Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.5990823Z Autotune Choices Stats: 2025-12-04T09:58:54.5991581Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.5991812Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.5991976Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.5992255Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.5992890Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5993524Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5994143Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5994777Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5995404Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5996080Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5996709Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5997344Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.5998003Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5998625Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.5998758Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.5998833Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.5998877Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.5998914Z unimplemented [] 2025-12-04T09:58:54.5998976Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.5999095Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.5999674Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.5999724Z graph_break [] 2025-12-04T09:58:54.5999798Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.5999840Z Autotune Choices Stats: 2025-12-04T09:58:54.6000577Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.6000709Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6000824Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6000984Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6001600Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6002222Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6002827Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6003443Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6004048Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6004658Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6005260Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6005873Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6006536Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6007138Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6007269Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.6007310Z Autotune Choices Stats: 2025-12-04T09:58:54.6008075Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.6008307Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6008470Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6008750Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6009379Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6010007Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6010657Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6011273Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6011912Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6012540Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6013168Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6013797Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6014423Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6015084Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6015215Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.6015291Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6015333Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6015373Z unimplemented [] 2025-12-04T09:58:54.6015434Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6015537Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6016174Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6016214Z graph_break [] 2025-12-04T09:58:54.6016289Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6016331Z Autotune Choices Stats: 2025-12-04T09:58:54.6017086Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.6017214Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6017329Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6017488Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6018095Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6018720Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6019318Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6019922Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6020545Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6021150Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6021768Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6022371Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6022971Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6023598Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6023728Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.6023770Z Autotune Choices Stats: 2025-12-04T09:58:54.6024525Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.6024755Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6024925Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6025221Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6025851Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6026506Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6027121Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6027778Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6028410Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6029046Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6029669Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6030309Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6030933Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6031558Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6031714Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.6031792Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6031835Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6031873Z unimplemented [] 2025-12-04T09:58:54.6031934Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6032043Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6032621Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6032662Z graph_break [] 2025-12-04T09:58:54.6032738Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6032780Z Autotune Choices Stats: 2025-12-04T09:58:54.6033536Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.6033675Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6033790Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6033951Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6034568Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6035175Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6035794Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6036439Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6037044Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6037671Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6038287Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6038890Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6039492Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6040109Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6040253Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.6040294Z Autotune Choices Stats: 2025-12-04T09:58:54.6041055Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.6041273Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6041437Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6041723Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6042355Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6042992Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6043620Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6044928Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6046305Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6047600Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6048899Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6050190Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6051497Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6052778Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6053599Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.6053839Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6053998Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6054112Z unimplemented [] 2025-12-04T09:58:54.6054247Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6054456Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6055161Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6055796Z graph_break [] 2025-12-04T09:58:54.6055962Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6056113Z Autotune Choices Stats: 2025-12-04T09:58:54.6056934Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.6057829Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6058106Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6058427Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6059246Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6060490Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6061717Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6062989Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6064240Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6065468Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6066746Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6068028Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6069276Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6070504Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6071355Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.6071613Z Autotune Choices Stats: 2025-12-04T09:58:54.6072463Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.6073473Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6073891Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6074374Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6075336Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6076650Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6077952Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6079227Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6080534Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6081820Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6083107Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6084416Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6085713Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6087050Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6087846Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.6088085Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6088241Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6088351Z unimplemented [] 2025-12-04T09:58:54.6088469Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6088664Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6089418Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6090059Z graph_break [] 2025-12-04T09:58:54.6090187Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6090339Z Autotune Choices Stats: 2025-12-04T09:58:54.6091144Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.6092046Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6092324Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6092631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6093455Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6094715Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6096011Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6097259Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6098542Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6099785Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6101025Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6102278Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6103529Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6104756Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6105527Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.6105731Z Autotune Choices Stats: 2025-12-04T09:58:54.6106604Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.6107636Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6108052Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6108529Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6109469Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6110773Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6112071Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6113347Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6114638Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6115990Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6117273Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6118556Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6119862Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6121157Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6121948Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.6122189Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6122347Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6122462Z unimplemented [] 2025-12-04T09:58:54.6122580Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6122773Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6123487Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6124140Z graph_break [] 2025-12-04T09:58:54.6124271Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6124422Z Autotune Choices Stats: 2025-12-04T09:58:54.6125234Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.6126169Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6126444Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6126753Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6127590Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6128840Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6130090Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6131331Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6132567Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6133850Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6135097Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6136372Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6137620Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6138877Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6139646Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.6139848Z Autotune Choices Stats: 2025-12-04T09:58:54.6140669Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.6141683Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6142125Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6142620Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6143571Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6144858Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6146202Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6147497Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6148782Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6150065Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6151383Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6152673Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6153956Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6155254Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6156083Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.6156326Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6156485Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6156594Z unimplemented [] 2025-12-04T09:58:54.6156709Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6156907Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6157620Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6158260Z graph_break [] 2025-12-04T09:58:54.6158391Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6158541Z Autotune Choices Stats: 2025-12-04T09:58:54.6159342Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.6160277Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6160551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6160860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6161666Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6162921Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6164168Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6165424Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6166692Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6167930Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6169201Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6170439Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6171701Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6172946Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6173724Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.6173930Z Autotune Choices Stats: 2025-12-04T09:58:54.6174752Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.6175755Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6176208Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6176684Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6177664Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6178952Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6180232Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6181545Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6182849Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6184137Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6185430Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6186785Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6188064Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6189357Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6190146Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.6190384Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6190536Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6190659Z unimplemented [] 2025-12-04T09:58:54.6190775Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6190969Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6191682Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.6192323Z graph_break [] 2025-12-04T09:58:54.6192452Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6192601Z Autotune Choices Stats: 2025-12-04T09:58:54.6193395Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.6194295Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6194583Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6194890Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6195720Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6197001Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6198256Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6199491Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6200749Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6201992Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6203233Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6204506Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6205745Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6207025Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6207791Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.6207996Z Autotune Choices Stats: 2025-12-04T09:58:54.6208812Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.6209836Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6210253Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6214461Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6215422Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6216819Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6218113Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6219424Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6220712Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6222032Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6223326Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6224603Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6225912Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6227243Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6228033Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.6228275Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6228434Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6228544Z unimplemented [] 2025-12-04T09:58:54.6228664Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6228882Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6229595Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.6230247Z graph_break [] 2025-12-04T09:58:54.6230379Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6230530Z Autotune Choices Stats: 2025-12-04T09:58:54.6231340Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.6232236Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6232513Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6232825Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6233644Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6234912Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6236195Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6237457Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6238698Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6239957Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6241195Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6242441Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6243708Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6244946Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6245713Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.6245915Z Autotune Choices Stats: 2025-12-04T09:58:54.6246797Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.6247807Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6248222Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6248698Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6249637Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6250931Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6252253Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6253529Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6254830Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6256161Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6257462Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6258744Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6260032Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6261351Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6262145Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.6262385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6262538Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6262647Z unimplemented [] 2025-12-04T09:58:54.6262763Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6262964Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6263690Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.6264325Z graph_break [] 2025-12-04T09:58:54.6264452Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6264605Z Autotune Choices Stats: 2025-12-04T09:58:54.6265419Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.6266358Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6266633Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6266942Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6267752Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6269022Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6270266Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6271506Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6272753Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6274001Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6275260Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6276539Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6277776Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6279050Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6279814Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.6280016Z Autotune Choices Stats: 2025-12-04T09:58:54.6280834Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.6281850Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6282265Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6282753Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6283705Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6284995Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6286310Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6287628Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6288922Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6290212Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6291493Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6292792Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6294085Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6295368Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6296232Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.6296471Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6296626Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6296733Z unimplemented [] 2025-12-04T09:58:54.6296849Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6297043Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6297755Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6298395Z graph_break [] 2025-12-04T09:58:54.6298524Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6298673Z Autotune Choices Stats: 2025-12-04T09:58:54.6299485Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.6300399Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6300673Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6300980Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6301785Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6303023Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6304291Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6305528Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6306809Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6308071Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6309321Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6310561Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6311807Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6313068Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6313837Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.6314039Z Autotune Choices Stats: 2025-12-04T09:58:54.6314862Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.6315864Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6316317Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6316805Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6317757Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6319065Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6320354Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6321661Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6322973Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6324262Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6325557Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6326901Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6328188Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6329475Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6330282Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.6330520Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6330674Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6330780Z unimplemented [] 2025-12-04T09:58:54.6330895Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6331104Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6331810Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6332441Z graph_break [] 2025-12-04T09:58:54.6332567Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6332715Z Autotune Choices Stats: 2025-12-04T09:58:54.6333525Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:54.6334421Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6334693Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6335026Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6335842Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6337129Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6338357Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6339642Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6340887Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6342118Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6343373Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6344622Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6345868Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6347132Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6347919Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:54.6348120Z Autotune Choices Stats: 2025-12-04T09:58:54.6348960Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.6349961Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6350373Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6350846Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6351810Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6353104Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6354378Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6355653Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6356999Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6358291Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6359576Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6360874Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6362166Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6363453Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6364243Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:54.6364479Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6364632Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6364741Z unimplemented [] 2025-12-04T09:58:54.6364858Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6365070Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6365795Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.6366469Z graph_break [] 2025-12-04T09:58:54.6366596Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6366747Z Autotune Choices Stats: 2025-12-04T09:58:54.6367548Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.6368448Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6368719Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6369042Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6369848Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6371114Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6372369Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6373612Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6374884Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6376167Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6377406Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6378682Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6379930Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6381169Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6381940Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:54.6382141Z Autotune Choices Stats: 2025-12-04T09:58:54.6382955Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.6383999Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6384418Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6384895Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6385848Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6387194Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6388490Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6389767Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6391052Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6392371Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6393659Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6394947Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6396293Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6397597Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6398388Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:54.6398627Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6398782Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6398889Z unimplemented [] 2025-12-04T09:58:54.6399006Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6399200Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6399913Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6400572Z graph_break [] 2025-12-04T09:58:54.6400700Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6400855Z Autotune Choices Stats: 2025-12-04T09:58:54.6401669Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:54.6402568Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6402841Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6403153Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6403968Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6405209Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6406499Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6407743Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6408977Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6410256Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6411503Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6412777Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6414018Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6415283Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6416084Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:54.6416290Z Autotune Choices Stats: 2025-12-04T09:58:54.6417120Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.6418157Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6418582Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6419089Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6420039Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6421342Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6422632Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6423941Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6425235Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6426583Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6427893Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6429187Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6430488Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6431773Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6432574Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:54.6432809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6432967Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6433085Z unimplemented [] 2025-12-04T09:58:54.6433209Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6433405Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6434128Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6434774Z graph_break [] 2025-12-04T09:58:54.6434911Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6435066Z Autotune Choices Stats: 2025-12-04T09:58:54.6435879Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:54.6436852Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6437133Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6437442Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6438259Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6439522Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6440764Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6442018Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6443261Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6444510Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6445779Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6447053Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6448315Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6449555Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6450347Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:54.6450550Z Autotune Choices Stats: 2025-12-04T09:58:54.6451383Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.6452391Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6452811Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6453317Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6454287Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6455581Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6456907Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6458202Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6459512Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6460808Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6462088Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6463400Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6464691Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6466035Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6466827Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:54.6467071Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6467247Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6467362Z unimplemented [] 2025-12-04T09:58:54.6467489Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6467692Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6468413Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6469054Z graph_break [] 2025-12-04T09:58:54.6469184Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6469347Z Autotune Choices Stats: 2025-12-04T09:58:54.6470162Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:54.6471075Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6471360Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6471674Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6472310Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6472920Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6473538Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6474138Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6474761Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6475374Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6476026Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6476668Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6477272Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6477895Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6478023Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:54.6478065Z Autotune Choices Stats: 2025-12-04T09:58:54.6478846Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.6479063Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6479229Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6479506Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6480142Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6480796Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6481425Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6482059Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6482687Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6483330Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6483954Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6484579Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6485232Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6485864Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6486034Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:54.6486108Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6486152Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6486191Z unimplemented [] 2025-12-04T09:58:54.6486278Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6486378Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6486952Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.6487007Z graph_break [] 2025-12-04T09:58:54.6487082Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6487122Z Autotune Choices Stats: 2025-12-04T09:58:54.6487862Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.6487992Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6488107Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6488272Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6488922Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6489520Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6490124Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6490747Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6491358Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6491962Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6492571Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6493182Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6493801Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6494403Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6494535Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:54.6494576Z Autotune Choices Stats: 2025-12-04T09:58:54.6495363Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.6495593Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6495761Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6496059Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6496691Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6497319Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6497979Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6498608Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6499250Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6499890Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6500530Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6501161Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6501796Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6502436Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6502568Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:54.6502644Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6502691Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6502729Z unimplemented [] 2025-12-04T09:58:54.6502790Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6502890Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6503475Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6503514Z graph_break [] 2025-12-04T09:58:54.6503589Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6503640Z Autotune Choices Stats: 2025-12-04T09:58:54.6504394Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:54.6504521Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6504634Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6504798Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6505412Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6506077Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6506685Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6507293Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6507921Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6508542Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6509159Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6509764Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6510398Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6511003Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6511134Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:54.6511176Z Autotune Choices Stats: 2025-12-04T09:58:54.6511951Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.6512167Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6512331Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6512621Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6513262Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6513900Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6514546Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6515172Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6515805Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6516493Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6517129Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6517756Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6518390Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6519041Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6519173Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:54.6519267Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.6519315Z Traceback (most recent call last): 2025-12-04T09:58:54.6519472Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.6519514Z self.assertTrue( 2025-12-04T09:58:54.6519620Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.6519671Z raise self.failureException(msg) 2025-12-04T09:58:54.6519799Z AssertionError: False is not true : Log file /tmp/tmprqbilwpb/flex_attention_configs.json was not created 2025-12-04T09:58:54.6519803Z 2025-12-04T09:58:54.6519880Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.6520045Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.6520049Z 2025-12-04T09:58:54.6520139Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.6520225Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6520270Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6520309Z unimplemented [] 2025-12-04T09:58:54.6520369Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6520942Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.6521053Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6521091Z graph_break [] 2025-12-04T09:58:54.6521166Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6521654Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.6521703Z current_size = base.storage().size() 2025-12-04T09:58:54.6521746Z Autotune Choices Stats: 2025-12-04T09:58:54.6522498Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.6522637Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6522756Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6522926Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6523537Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6524147Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6524755Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6525371Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6526008Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6526616Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6527238Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6527849Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6528445Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6529058Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6529189Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.6529244Z Autotune Choices Stats: 2025-12-04T09:58:54.6529995Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.6530216Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6530384Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6530662Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6531294Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6531939Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6532562Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6533195Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6533821Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6534458Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6535080Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6535715Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6536399Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6537027Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6537156Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.6537231Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6537286Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6537325Z unimplemented [] 2025-12-04T09:58:54.6537387Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6537488Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6538063Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6538115Z graph_break [] 2025-12-04T09:58:54.6538190Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6538230Z Autotune Choices Stats: 2025-12-04T09:58:54.6538970Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.6539102Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6539218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6539379Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6540014Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6540616Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6541215Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6541838Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6542453Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6543051Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6543653Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6544280Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6544887Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6545488Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6545617Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.6545659Z Autotune Choices Stats: 2025-12-04T09:58:54.6546488Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.6546717Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6546887Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6547164Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6547798Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6548421Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6549075Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6549693Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6550323Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6550949Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6551584Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6552207Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6552847Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6553480Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6553610Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.6553687Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6553730Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6553769Z unimplemented [] 2025-12-04T09:58:54.6553832Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6553933Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6554528Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.6554567Z graph_break [] 2025-12-04T09:58:54.6554644Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6554694Z Autotune Choices Stats: 2025-12-04T09:58:54.6555426Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.6555553Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6555670Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6555837Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6556489Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6557130Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6557731Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6558331Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6558958Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6559574Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6560182Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6560784Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6561407Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6562011Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6562140Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.6562182Z Autotune Choices Stats: 2025-12-04T09:58:54.6562953Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.6563168Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6563333Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6563623Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6564261Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6564882Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6565521Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6566197Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6566823Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6567461Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6568100Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6568726Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6569359Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6570010Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6570139Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.6570213Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6570258Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6570297Z unimplemented [] 2025-12-04T09:58:54.6570363Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6570461Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6571040Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6571077Z graph_break [] 2025-12-04T09:58:54.6571153Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6571192Z Autotune Choices Stats: 2025-12-04T09:58:54.6571945Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.6572083Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6572198Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6572357Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6572975Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6573579Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6574212Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6574816Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6575419Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6576074Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6576688Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6577294Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6577894Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6578516Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6578649Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.6578689Z Autotune Choices Stats: 2025-12-04T09:58:54.6579457Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.6579675Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6579853Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6580135Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6580776Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6581402Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6582025Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6582674Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6583305Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6583933Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6584564Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6585199Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6585826Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6586479Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6586625Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.6586700Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6586745Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6586781Z unimplemented [] 2025-12-04T09:58:54.6586843Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6586955Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6587535Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6587572Z graph_break [] 2025-12-04T09:58:54.6587647Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6587688Z Autotune Choices Stats: 2025-12-04T09:58:54.6588465Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.6588603Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6588716Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6588898Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6589508Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6590116Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6590720Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6591340Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6591950Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6592556Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6593168Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6593783Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6594382Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6594989Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6595127Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.6595167Z Autotune Choices Stats: 2025-12-04T09:58:54.6595966Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.6596186Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6596349Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6596630Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6597294Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6597932Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6598557Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6599175Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6599827Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6600458Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6601086Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6601725Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6602365Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6602987Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6603117Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.6603293Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6603337Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6603374Z unimplemented [] 2025-12-04T09:58:54.6603458Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6603558Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6604144Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6604182Z graph_break [] 2025-12-04T09:58:54.6604256Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6604296Z Autotune Choices Stats: 2025-12-04T09:58:54.6605039Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.6605168Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6605280Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6605463Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6606108Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6606728Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6607338Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6607943Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6608570Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6609170Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6609775Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6610386Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6611000Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6611611Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6611742Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.6611783Z Autotune Choices Stats: 2025-12-04T09:58:54.6612535Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.6612775Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6612941Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6613215Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6613852Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6614484Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6615115Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6615746Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6616408Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6617064Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6617685Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6618318Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6618959Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6619593Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6619724Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.6619799Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6619841Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6619880Z unimplemented [] 2025-12-04T09:58:54.6619942Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6620043Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6620619Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.6620666Z graph_break [] 2025-12-04T09:58:54.6620739Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6620782Z Autotune Choices Stats: 2025-12-04T09:58:54.6621535Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.6621666Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6621781Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6621941Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6622560Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6623166Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6623776Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6624380Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6624982Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6625607Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6626251Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6626868Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6627469Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6628088Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6628218Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.6628261Z Autotune Choices Stats: 2025-12-04T09:58:54.6629018Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.6629248Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6629412Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6629702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6630336Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6630965Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6631597Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6632233Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6632862Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6633490Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6634131Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6634758Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6635406Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6636068Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6636208Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.6636285Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6636326Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6636364Z unimplemented [] 2025-12-04T09:58:54.6636424Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6636526Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6637102Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6637139Z graph_break [] 2025-12-04T09:58:54.6637213Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6637255Z Autotune Choices Stats: 2025-12-04T09:58:54.6638004Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.6638169Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6638284Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6638444Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6639061Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6639678Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6640284Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6640901Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6641508Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6642110Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6642734Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6643338Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6643951Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6644553Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6644692Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.6644734Z Autotune Choices Stats: 2025-12-04T09:58:54.6645508Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.6645727Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6645894Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6646218Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6646875Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6647505Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6648146Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6648766Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6649413Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6650056Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6650679Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6651331Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6651958Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6652595Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6652724Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.6652803Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6652861Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6652900Z unimplemented [] 2025-12-04T09:58:54.6652962Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6653065Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6653644Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6653682Z graph_break [] 2025-12-04T09:58:54.6653760Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6653802Z Autotune Choices Stats: 2025-12-04T09:58:54.6654558Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.6654694Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6654809Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6654970Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6655593Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6656240Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6656865Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6657464Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6658082Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6658691Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6659294Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6659921Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6660530Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6661144Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6661273Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.6661316Z Autotune Choices Stats: 2025-12-04T09:58:54.6662081Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.6662297Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6662467Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6662746Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6663385Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6664033Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6664656Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6665295Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6665964Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6666601Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6667229Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6667858Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6668513Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6669143Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6669270Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.6669348Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6669389Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6669426Z unimplemented [] 2025-12-04T09:58:54.6669504Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6669608Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6670186Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6670231Z graph_break [] 2025-12-04T09:58:54.6670306Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6670345Z Autotune Choices Stats: 2025-12-04T09:58:54.6671087Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.6671220Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6671335Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6671498Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6672135Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6672734Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6673340Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6673958Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6674553Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6675167Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6675776Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6676421Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6677064Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6677667Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6677798Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.6677839Z Autotune Choices Stats: 2025-12-04T09:58:54.6678613Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.6678845Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6679015Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6679293Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6679930Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6680556Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6681204Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6681831Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6682470Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6683099Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6683731Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6684356Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6684978Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6685637Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6685767Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.6685842Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6685886Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6685960Z unimplemented [] 2025-12-04T09:58:54.6686022Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6686122Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6686717Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6686756Z graph_break [] 2025-12-04T09:58:54.6686832Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6686886Z Autotune Choices Stats: 2025-12-04T09:58:54.6687636Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.6687769Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6687881Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6688044Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6688655Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6689286Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6689890Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6690491Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6691109Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6691715Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6692329Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6692935Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6693543Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6694155Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6694286Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.6694326Z Autotune Choices Stats: 2025-12-04T09:58:54.6695100Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.6695321Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6695486Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6695777Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6696453Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6697084Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6697707Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6698356Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6698982Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6699626Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6700250Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6700888Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6701519Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6702172Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6702302Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.6702377Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6702423Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6702459Z unimplemented [] 2025-12-04T09:58:54.6702522Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6702623Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6703197Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6703236Z graph_break [] 2025-12-04T09:58:54.6703310Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6703351Z Autotune Choices Stats: 2025-12-04T09:58:54.6704103Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.6704245Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6704358Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6704523Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6705141Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6705747Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6706419Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6707023Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6707624Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6708245Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6708858Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6709465Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6710076Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6710702Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6710833Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.6710872Z Autotune Choices Stats: 2025-12-04T09:58:54.6711631Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.6711851Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6712026Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6712306Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6712946Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6713578Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6714199Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6714845Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6715477Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6716207Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6716841Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6717489Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6718118Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6718740Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6718880Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.6718955Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6718999Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6719037Z unimplemented [] 2025-12-04T09:58:54.6719099Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6719214Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6719802Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.6719842Z graph_break [] 2025-12-04T09:58:54.6719915Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6719956Z Autotune Choices Stats: 2025-12-04T09:58:54.6720712Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.6720843Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6720957Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6721129Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6721744Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6722348Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6722957Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6723577Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6724178Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6724786Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6725399Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6726050Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6726654Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6727257Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6727401Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.6727442Z Autotune Choices Stats: 2025-12-04T09:58:54.6728208Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.6728425Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6728589Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6728870Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6729523Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6730159Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6730784Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6731409Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6732064Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6732691Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6733315Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6733967Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6734611Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6735240Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6735368Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.6735445Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6735486Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6738127Z unimplemented [] 2025-12-04T09:58:54.6738199Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6738332Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6738921Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6738962Z graph_break [] 2025-12-04T09:58:54.6739038Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6739079Z Autotune Choices Stats: 2025-12-04T09:58:54.6739828Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.6739964Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6740081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6740264Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6740875Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6741499Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6742105Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6742704Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6743335Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6743944Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6744543Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6745153Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6745757Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6746407Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6746538Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.6746580Z Autotune Choices Stats: 2025-12-04T09:58:54.6747337Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.6747589Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6747756Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6748030Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6748667Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6749307Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6749943Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6750572Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6751202Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6751855Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6752476Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6753108Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6753741Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6754377Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6754505Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.6754581Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6754626Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6754664Z unimplemented [] 2025-12-04T09:58:54.6754726Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6754827Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6755413Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6755466Z graph_break [] 2025-12-04T09:58:54.6755539Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6755581Z Autotune Choices Stats: 2025-12-04T09:58:54.6756374Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.6756503Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6756618Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6756779Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6757400Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6758003Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6758615Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6759217Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6759827Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6760447Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6761045Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6761656Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6762258Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6762871Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6763001Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.6763042Z Autotune Choices Stats: 2025-12-04T09:58:54.6763794Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.6764022Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6764189Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6764476Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6765107Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6765731Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6766412Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6767048Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6767675Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6768298Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6768953Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6769585Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6770224Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6770843Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6770981Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.6771059Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6771100Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6771139Z unimplemented [] 2025-12-04T09:58:54.6771200Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6771304Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6771889Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6771929Z graph_break [] 2025-12-04T09:58:54.6772003Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6772042Z Autotune Choices Stats: 2025-12-04T09:58:54.6772784Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.6772935Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6773051Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6773214Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6773825Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6774439Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6775046Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6775655Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6776299Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6776906Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6777537Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6778136Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6778754Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6779364Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6779504Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.6779545Z Autotune Choices Stats: 2025-12-04T09:58:54.6780297Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.6780517Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6780685Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6780974Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6781616Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6782239Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6782877Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6783500Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6784146Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6784773Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6785396Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6786089Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6786715Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6787356Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6787483Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.6787557Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6787611Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6787651Z unimplemented [] 2025-12-04T09:58:54.6787712Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6787813Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6788386Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6788422Z graph_break [] 2025-12-04T09:58:54.6788496Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6788537Z Autotune Choices Stats: 2025-12-04T09:58:54.6789292Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.6789431Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6789546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6789707Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6790330Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6790932Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6791550Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6792152Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6792768Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6793378Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6793982Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6794601Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6795204Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6795816Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6795982Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.6796022Z Autotune Choices Stats: 2025-12-04T09:58:54.6796784Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.6797014Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6797178Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6797461Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6798094Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6798746Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6799359Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6800007Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6800631Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6801272Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6801896Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6802520Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6803171Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6803796Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6803927Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.6804001Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6804043Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6804079Z unimplemented [] 2025-12-04T09:58:54.6804149Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6804250Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6804819Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6804868Z graph_break [] 2025-12-04T09:58:54.6804941Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6804980Z Autotune Choices Stats: 2025-12-04T09:58:54.6805721Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.6805852Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6805996Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6806157Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6806782Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6807401Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6808007Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6808632Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6809232Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6809845Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6810453Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6811054Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6811687Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6812295Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6812425Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.6812464Z Autotune Choices Stats: 2025-12-04T09:58:54.6813235Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.6813462Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6813629Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6813904Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6814531Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6815156Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6815815Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6816493Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6817147Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6817768Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6818401Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6819033Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6819653Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6820322Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6820452Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.6820526Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6820570Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6820606Z unimplemented [] 2025-12-04T09:58:54.6820668Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6820767Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6821352Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6821390Z graph_break [] 2025-12-04T09:58:54.6821463Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6821503Z Autotune Choices Stats: 2025-12-04T09:58:54.6822247Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.6822375Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6822487Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6822653Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6823264Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6823888Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6824496Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6825094Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6825706Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6826349Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6826994Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6827598Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6829283Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6829930Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6830069Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.6830110Z Autotune Choices Stats: 2025-12-04T09:58:54.6830864Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.6831095Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6831262Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6831541Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6832176Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6832802Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6833423Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6834120Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6834859Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6835494Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6836152Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6836781Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6837407Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6838079Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6838220Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.6838298Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6838340Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6838378Z unimplemented [] 2025-12-04T09:58:54.6838441Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6838546Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6839125Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6839161Z graph_break [] 2025-12-04T09:58:54.6839236Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6839275Z Autotune Choices Stats: 2025-12-04T09:58:54.6840029Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.6840157Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6840272Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6840432Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6841040Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6841645Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6842278Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6842875Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6843482Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6844098Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6844700Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6845305Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6845906Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6846576Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6846705Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.6846745Z Autotune Choices Stats: 2025-12-04T09:58:54.6847505Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.6847723Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6847888Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6848184Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6848810Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6849434Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6850063Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6850708Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6851345Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6851972Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6852604Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6853234Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6853857Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6854489Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6854639Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.6854713Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6854755Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6854791Z unimplemented [] 2025-12-04T09:58:54.6854852Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6854960Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6855541Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6855580Z graph_break [] 2025-12-04T09:58:54.6855653Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6855692Z Autotune Choices Stats: 2025-12-04T09:58:54.6856488Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.6856619Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6856732Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6856893Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6857502Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6858099Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6858709Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6859343Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6859938Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6860536Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6861148Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6861750Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6862352Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6862959Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6863109Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.6863149Z Autotune Choices Stats: 2025-12-04T09:58:54.6863919Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.6864136Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6864301Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6864578Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6865231Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6865847Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6866502Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6867131Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6867814Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6868431Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6869057Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6869698Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6870322Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6870946Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6871076Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.6871153Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6871195Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6871233Z unimplemented [] 2025-12-04T09:58:54.6871295Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6871421Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6872008Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.6872047Z graph_break [] 2025-12-04T09:58:54.6872119Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6872159Z Autotune Choices Stats: 2025-12-04T09:58:54.6872898Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.6873027Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6873142Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6873303Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6873931Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6874528Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6875129Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6875739Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6876402Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6877004Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6877609Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6878225Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6878832Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6879434Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6879566Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.6879607Z Autotune Choices Stats: 2025-12-04T09:58:54.6880373Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.6880637Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6880804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6881085Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6881723Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6882359Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6882983Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6883676Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6884308Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6884974Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6885593Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6886264Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6886908Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6887529Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6887656Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.6887733Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6887774Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6887812Z unimplemented [] 2025-12-04T09:58:54.6887872Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6887974Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6888545Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.6888613Z graph_break [] 2025-12-04T09:58:54.6888685Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6888727Z Autotune Choices Stats: 2025-12-04T09:58:54.6889494Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.6889623Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6889740Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6889901Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6890525Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6891128Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6891729Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6892331Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6892932Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6893572Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6894175Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6894780Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6895395Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6896054Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6896182Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.6896226Z Autotune Choices Stats: 2025-12-04T09:58:54.6896984Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.6897236Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6897406Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6897697Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6898335Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6898964Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6899600Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6900225Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6900851Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6901476Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6902131Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6902764Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6903401Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6904032Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6904160Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.6904234Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6904276Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6904313Z unimplemented [] 2025-12-04T09:58:54.6904375Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6904476Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6905050Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.6905087Z graph_break [] 2025-12-04T09:58:54.6905161Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6905201Z Autotune Choices Stats: 2025-12-04T09:58:54.6905992Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.6906155Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6906270Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6906430Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6907045Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6907663Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6908267Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6908860Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6909466Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6910068Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6910702Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6911307Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6911925Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6912533Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6912662Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.6912704Z Autotune Choices Stats: 2025-12-04T09:58:54.6913461Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.6913677Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6913843Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6914138Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6914788Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6915419Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6916114Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6916736Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6917365Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6917989Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6918612Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6919273Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6919900Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6920539Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6920668Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.6920743Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6920784Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6920823Z unimplemented [] 2025-12-04T09:58:54.6920883Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6920984Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6921560Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6921598Z graph_break [] 2025-12-04T09:58:54.6921672Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6921712Z Autotune Choices Stats: 2025-12-04T09:58:54.6922456Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.6922603Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6922716Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6922877Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6923491Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6924099Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6924709Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6925319Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6925965Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6926568Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6927170Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6927819Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6928424Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6929038Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6929169Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.6929209Z Autotune Choices Stats: 2025-12-04T09:58:54.6929980Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.6930197Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6930363Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6930644Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6931279Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6931938Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6932561Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6933200Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6933832Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6934462Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6935086Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6935710Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6936413Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6937038Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6937166Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.6937238Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6937282Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6937319Z unimplemented [] 2025-12-04T09:58:54.6937393Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6937494Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6938069Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6938107Z graph_break [] 2025-12-04T09:58:54.6938182Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6938222Z Autotune Choices Stats: 2025-12-04T09:58:54.6938969Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:54.6939102Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6939216Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6939376Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6940023Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6940626Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6941228Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6941849Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6942452Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6943060Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6943668Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6944271Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6944907Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6945514Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6945642Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:54.6945681Z Autotune Choices Stats: 2025-12-04T09:58:54.6946496Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.6946715Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6946881Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6947162Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6947794Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6948427Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6949113Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6949741Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6950377Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6951006Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6951637Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6952263Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6952896Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6953553Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6953685Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:54.6953758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6953802Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6953840Z unimplemented [] 2025-12-04T09:58:54.6953904Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6954005Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6954594Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.6954634Z graph_break [] 2025-12-04T09:58:54.6954708Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6954749Z Autotune Choices Stats: 2025-12-04T09:58:54.6955484Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.6955612Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6955724Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6955887Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6956545Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6957189Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6957802Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6958405Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6959025Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6959626Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6960227Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6960830Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6961468Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6962075Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6962209Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:54.6962249Z Autotune Choices Stats: 2025-12-04T09:58:54.6963021Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.6963239Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6963402Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6963685Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6964317Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6964945Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6965568Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6966257Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6966882Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6967521Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6968140Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6968768Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6969401Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6970062Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6970195Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:54.6970268Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6970312Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6970348Z unimplemented [] 2025-12-04T09:58:54.6970411Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6970512Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6971096Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6971133Z graph_break [] 2025-12-04T09:58:54.6971207Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6971248Z Autotune Choices Stats: 2025-12-04T09:58:54.6972005Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:54.6972134Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6972247Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6972409Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6973023Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6973626Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6974259Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6974855Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6975464Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6976125Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6976733Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6977338Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6977947Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6978579Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6978708Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:54.6978748Z Autotune Choices Stats: 2025-12-04T09:58:54.6979513Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.6979734Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6979913Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6980192Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6980825Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6981450Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6982082Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6982735Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6983357Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6983993Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6984630Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6985259Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6985887Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6986564Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6986717Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:54.6986792Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.6986833Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.6986872Z unimplemented [] 2025-12-04T09:58:54.6986932Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.6987043Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.6987620Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.6987658Z graph_break [] 2025-12-04T09:58:54.6987731Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.6987771Z Autotune Choices Stats: 2025-12-04T09:58:54.6988526Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:54.6988657Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6988770Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6988932Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6989545Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6990158Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6990765Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6991404Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6992005Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.6992613Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6993238Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6993844Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6994447Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6995053Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6995204Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:54.6995247Z Autotune Choices Stats: 2025-12-04T09:58:54.6996065Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.6996286Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.6996450Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.6996727Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.6997377Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6998002Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6998622Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6999247Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.6999914Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7000539Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7001169Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7001814Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7002438Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7003062Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7003189Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:54.7003264Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7003306Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7003345Z unimplemented [] 2025-12-04T09:58:54.7003433Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7003536Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7004129Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7004169Z graph_break [] 2025-12-04T09:58:54.7004245Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7004284Z Autotune Choices Stats: 2025-12-04T09:58:54.7005022Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:54.7005148Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7005263Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7005435Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7006084Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7006684Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7007281Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7007886Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7008553Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7009155Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7009773Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7010385Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7010987Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7011590Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7011717Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:54.7011757Z Autotune Choices Stats: 2025-12-04T09:58:54.7012533Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.7012769Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7012936Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7013213Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7013846Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7014485Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7015117Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7015739Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7016401Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7017067Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7017692Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7018336Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7018964Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7019596Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7019726Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:54.7019801Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7019846Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7019884Z unimplemented [] 2025-12-04T09:58:54.7019946Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7020047Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7020623Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7020682Z graph_break [] 2025-12-04T09:58:54.7020758Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7020797Z Autotune Choices Stats: 2025-12-04T09:58:54.7021554Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.7021682Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7021796Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7021961Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7022575Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7023184Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7023794Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7024394Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7024997Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7025639Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7026286Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7026908Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7027513Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7028122Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7028253Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:54.7028293Z Autotune Choices Stats: 2025-12-04T09:58:54.7029047Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.7029288Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7029466Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7029745Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7030378Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7031021Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7031650Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7032272Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7032912Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7033542Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7034199Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7034827Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7035468Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7036142Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7036272Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:54.7036345Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7036390Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7036429Z unimplemented [] 2025-12-04T09:58:54.7036489Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7036587Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7037173Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7037211Z graph_break [] 2025-12-04T09:58:54.7037285Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7037324Z Autotune Choices Stats: 2025-12-04T09:58:54.7038075Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:54.7038231Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7038349Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7038512Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7039127Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7039760Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7040359Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7040963Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7041572Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7042178Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7042813Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7043419Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7044031Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7044634Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7044764Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:54.7044804Z Autotune Choices Stats: 2025-12-04T09:58:54.7045564Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.7045785Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7045986Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7046290Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7046938Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7047563Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7048201Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7048826Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7049452Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7050075Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7050692Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7051353Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7051977Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7052616Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7052746Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:54.7052822Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7052865Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7052901Z unimplemented [] 2025-12-04T09:58:54.7052962Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7053062Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7053637Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7053677Z graph_break [] 2025-12-04T09:58:54.7053752Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7053793Z Autotune Choices Stats: 2025-12-04T09:58:54.7054534Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:54.7054687Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7054802Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7054985Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7055606Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7056258Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7056887Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7057491Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7058100Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7058705Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7059338Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7059970Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7060577Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7061195Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7061328Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:54.7061369Z Autotune Choices Stats: 2025-12-04T09:58:54.7062131Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.7062353Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7062521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7062800Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7063436Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7064096Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7064726Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7065371Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7066043Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7066672Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7067298Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7067931Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7068603Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7069227Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7069356Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:54.7069450Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.7069515Z Traceback (most recent call last): 2025-12-04T09:58:54.7069671Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.7069715Z self.assertTrue( 2025-12-04T09:58:54.7069821Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.7069871Z raise self.failureException(msg) 2025-12-04T09:58:54.7069998Z AssertionError: False is not true : Log file /tmp/tmpfcr2ai80/flex_attention_configs.json was not created 2025-12-04T09:58:54.7070005Z 2025-12-04T09:58:54.7070082Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.7070251Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.7070253Z 2025-12-04T09:58:54.7070344Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.7070425Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7070471Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7070512Z unimplemented [] 2025-12-04T09:58:54.7070574Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7071166Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.7071268Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7071309Z graph_break [] 2025-12-04T09:58:54.7071385Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7071908Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.7071961Z current_size = base.storage().size() 2025-12-04T09:58:54.7072003Z Autotune Choices Stats: 2025-12-04T09:58:54.7072757Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.7072887Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7073003Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7073165Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7073788Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7074402Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7075005Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7075608Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7076247Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7076886Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7077487Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7078112Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7078713Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7079320Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7079483Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.7079543Z Autotune Choices Stats: 2025-12-04T09:58:54.7080331Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.7080620Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7080844Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7081124Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7081760Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7082398Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7083030Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7083695Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7084326Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7084962Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7085625Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7086284Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7086927Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7087560Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7087694Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.7087774Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7087823Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7087862Z unimplemented [] 2025-12-04T09:58:54.7087927Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7088027Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7088616Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7088656Z graph_break [] 2025-12-04T09:58:54.7088731Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7088775Z Autotune Choices Stats: 2025-12-04T09:58:54.7089566Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.7089697Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7089817Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7089983Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7090602Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7091219Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7091826Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7092451Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7093067Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7093669Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7094321Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7094930Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7095543Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7096191Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7096323Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.7096363Z Autotune Choices Stats: 2025-12-04T09:58:54.7097125Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.7097345Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7097511Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7097824Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7098474Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7099098Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7099734Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7100359Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7100990Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7101615Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7102238Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7102904Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7103523Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7104157Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7104291Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.7104371Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7104414Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7104453Z unimplemented [] 2025-12-04T09:58:54.7104516Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7104618Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7105197Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7105235Z graph_break [] 2025-12-04T09:58:54.7105309Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7105351Z Autotune Choices Stats: 2025-12-04T09:58:54.7106140Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.7106313Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7106430Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7106607Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7107227Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7107830Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7108446Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7109052Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7109662Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7110266Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7110886Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7111504Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7112182Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7112797Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7112930Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.7112975Z Autotune Choices Stats: 2025-12-04T09:58:54.7113745Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.7113965Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7114131Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7114407Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7115042Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7115704Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7116372Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7117010Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7117636Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7118269Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7118890Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7119518Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7120193Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7120816Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7120945Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.7121022Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7121075Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7121116Z unimplemented [] 2025-12-04T09:58:54.7121177Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7121279Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7121860Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7121900Z graph_break [] 2025-12-04T09:58:54.7121974Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7122016Z Autotune Choices Stats: 2025-12-04T09:58:54.7122764Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.7122892Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7123005Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7123165Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7123813Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7124417Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7125019Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7125636Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7126408Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7127015Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7127614Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7128256Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7128857Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7129465Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7129593Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.7129635Z Autotune Choices Stats: 2025-12-04T09:58:54.7130411Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.7130629Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7130796Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7131075Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7131723Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7132354Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7133003Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7133628Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7134265Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7134891Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7135513Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7136181Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7136865Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7137486Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7137615Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.7137692Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7137734Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7137772Z unimplemented [] 2025-12-04T09:58:54.7137834Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7137937Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7138524Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7138564Z graph_break [] 2025-12-04T09:58:54.7138641Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7138682Z Autotune Choices Stats: 2025-12-04T09:58:54.7139429Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.7139556Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7139672Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7139836Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7140447Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7141085Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7141687Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7142291Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7142901Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7143498Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7144108Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7144710Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7145342Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7145995Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7146127Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.7146170Z Autotune Choices Stats: 2025-12-04T09:58:54.7146949Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.7147164Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7147330Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7147606Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7148244Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7148872Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7149530Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7150160Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7150790Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7151427Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7152052Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7152676Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7153307Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7153968Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7154098Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.7154175Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7154221Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7154262Z unimplemented [] 2025-12-04T09:58:54.7154323Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7154426Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7155006Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7155043Z graph_break [] 2025-12-04T09:58:54.7155118Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7155159Z Autotune Choices Stats: 2025-12-04T09:58:54.7155918Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.7156083Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7156198Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7156364Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7156977Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7157590Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7158235Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7158839Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7159445Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7160068Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7160673Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7161273Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7161874Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7162508Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7162639Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.7162679Z Autotune Choices Stats: 2025-12-04T09:58:54.7163436Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.7163653Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7163827Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7164105Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7164737Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7165365Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7166026Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7166685Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7167314Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7167943Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7168580Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7169216Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7169843Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7170470Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7170621Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.7170694Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7170737Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7170773Z unimplemented [] 2025-12-04T09:58:54.7170846Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7170946Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7171519Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7171557Z graph_break [] 2025-12-04T09:58:54.7171633Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7171673Z Autotune Choices Stats: 2025-12-04T09:58:54.7172424Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.7172552Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7172665Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7172827Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7173444Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7174050Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7174656Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7175290Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7175895Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7176528Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7177146Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7177755Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7178360Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7178964Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7179121Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.7179162Z Autotune Choices Stats: 2025-12-04T09:58:54.7179934Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.7180152Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7180325Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7180602Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7181242Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7181868Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7182496Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7183120Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7183776Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7184396Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7185024Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7185663Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7186320Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7186954Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7187082Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.7187155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7187201Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7187267Z unimplemented [] 2025-12-04T09:58:54.7187331Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7187431Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7188023Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7188059Z graph_break [] 2025-12-04T09:58:54.7188136Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7188179Z Autotune Choices Stats: 2025-12-04T09:58:54.7188921Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.7189053Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7189167Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7189341Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7189957Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7190560Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7191163Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7191766Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7192399Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7193003Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7193617Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7194223Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7194827Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7195438Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7195568Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.7195609Z Autotune Choices Stats: 2025-12-04T09:58:54.7196426Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.7196668Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7196834Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7197111Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7197739Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7198374Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7198997Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7199625Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7200250Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7200906Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7201532Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7202175Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7202796Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7203423Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7203555Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.7203631Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7203676Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7203714Z unimplemented [] 2025-12-04T09:58:54.7203778Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7203878Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7204451Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7204510Z graph_break [] 2025-12-04T09:58:54.7204584Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7204627Z Autotune Choices Stats: 2025-12-04T09:58:54.7205376Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.7205511Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7205626Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7205788Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7206454Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7207058Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7207662Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7208269Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7208874Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7209509Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7210111Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7210729Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7211329Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7211932Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7212063Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.7212103Z Autotune Choices Stats: 2025-12-04T09:58:54.7212863Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.7213099Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7213273Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7213551Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7214184Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7214822Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7215447Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7216097Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7216729Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7217352Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7218021Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7218647Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7219289Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7219916Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7220045Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.7220123Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7220164Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7220203Z unimplemented [] 2025-12-04T09:58:54.7220265Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7220367Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7220941Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7220979Z graph_break [] 2025-12-04T09:58:54.7221052Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7221093Z Autotune Choices Stats: 2025-12-04T09:58:54.7221852Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.7221999Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7222114Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7222274Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7222882Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7223498Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7224099Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7224707Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7225311Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7225918Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7226602Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7227204Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7227818Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7228423Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7228554Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.7228595Z Autotune Choices Stats: 2025-12-04T09:58:54.7229360Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.7229578Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7229744Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7230040Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7230684Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7231310Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7231947Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7232575Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7233203Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7233834Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7234459Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7235117Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7235746Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7236425Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7236553Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.7236629Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7236671Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7236709Z unimplemented [] 2025-12-04T09:58:54.7236770Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7236871Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7237448Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7237488Z graph_break [] 2025-12-04T09:58:54.7237561Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7242729Z Autotune Choices Stats: 2025-12-04T09:58:54.7243498Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.7243683Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7243801Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7243982Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7244597Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7245206Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7245815Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7246468Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7247069Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7247674Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7248278Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7248927Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7249533Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7250145Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7250278Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.7250321Z Autotune Choices Stats: 2025-12-04T09:58:54.7251081Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.7251302Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7251472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7251751Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7252392Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7253061Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7253683Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7254317Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7254944Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7255572Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7256232Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7256860Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7257536Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7258153Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7258281Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.7258360Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7258404Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7258457Z unimplemented [] 2025-12-04T09:58:54.7258521Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7258626Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7259202Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7259242Z graph_break [] 2025-12-04T09:58:54.7259319Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7259359Z Autotune Choices Stats: 2025-12-04T09:58:54.7260101Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.7260230Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7260348Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7260513Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7261159Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7261764Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7262363Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7262970Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7263571Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7264185Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7264790Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7265416Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7266074Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7266677Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7266806Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.7266849Z Autotune Choices Stats: 2025-12-04T09:58:54.7267611Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.7267831Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7267998Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7268281Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7268914Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7269541Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7270206Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7270831Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7271468Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7272094Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7272717Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7273342Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7273984Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7274618Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7274748Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.7274825Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7274867Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7274906Z unimplemented [] 2025-12-04T09:58:54.7274968Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7275069Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7275656Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7275695Z graph_break [] 2025-12-04T09:58:54.7275770Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7275809Z Autotune Choices Stats: 2025-12-04T09:58:54.7276593Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.7276723Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7276840Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7277002Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7277621Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7278270Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7278869Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7279478Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7280090Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7280695Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7281297Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7281900Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7282529Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7283130Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7283259Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.7283300Z Autotune Choices Stats: 2025-12-04T09:58:54.7284072Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.7284290Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7284456Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7284734Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7285365Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7286028Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7286691Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7287326Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7287953Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7288595Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7289216Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7289843Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7290474Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7291125Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7291256Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.7291332Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7291375Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7291413Z unimplemented [] 2025-12-04T09:58:54.7291477Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7291578Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7292158Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7292194Z graph_break [] 2025-12-04T09:58:54.7292270Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7292311Z Autotune Choices Stats: 2025-12-04T09:58:54.7293068Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.7293197Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7293311Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7293473Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7294086Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7294690Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7295331Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7295968Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7296572Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7297200Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7297808Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7298410Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7299009Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7299675Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7299804Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.7299844Z Autotune Choices Stats: 2025-12-04T09:58:54.7300607Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.7300824Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7301009Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7301285Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7301907Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7302529Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7303156Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7303809Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7304433Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7305061Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7305689Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7306364Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7306989Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7307609Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7307775Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.7307849Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7307891Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7307928Z unimplemented [] 2025-12-04T09:58:54.7307990Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7308104Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7308682Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7308720Z graph_break [] 2025-12-04T09:58:54.7308795Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7308835Z Autotune Choices Stats: 2025-12-04T09:58:54.7309597Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.7309726Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7309839Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7310001Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7310610Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7311220Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7311832Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7312471Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7313073Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7313677Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7314290Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7314893Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7315493Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7316139Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7316304Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.7316345Z Autotune Choices Stats: 2025-12-04T09:58:54.7317130Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.7317350Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7317514Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7317794Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7318440Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7319064Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7319687Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7320308Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7320965Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7321590Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7322218Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7322857Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7323484Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7324108Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7324236Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.7324310Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7324354Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7324390Z unimplemented [] 2025-12-04T09:58:54.7324473Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7324573Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7325175Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7325212Z graph_break [] 2025-12-04T09:58:54.7325285Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7325326Z Autotune Choices Stats: 2025-12-04T09:58:54.7326106Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.7326235Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7326349Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7326524Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7327140Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7327739Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7328345Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7328943Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7329584Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7330181Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7330787Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7331403Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7332002Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7332600Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7332731Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.7332771Z Autotune Choices Stats: 2025-12-04T09:58:54.7333529Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.7333788Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7333954Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7334237Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7334871Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7335503Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7336170Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7336786Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7337414Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7338090Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7338714Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7339348Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7339989Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7340605Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7340735Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.7340810Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7340852Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7340890Z unimplemented [] 2025-12-04T09:58:54.7340949Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7341051Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7341625Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7341689Z graph_break [] 2025-12-04T09:58:54.7341763Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7341804Z Autotune Choices Stats: 2025-12-04T09:58:54.7342564Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.7342692Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7342810Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7342971Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7343603Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7344205Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7344808Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7345412Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7346041Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7346706Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7347310Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7347937Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7348538Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7349136Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7349266Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.7349306Z Autotune Choices Stats: 2025-12-04T09:58:54.7350064Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.7350314Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7350478Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7350767Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7351400Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7352027Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7352648Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7353275Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7353900Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7354525Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7355185Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7355807Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7356486Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7357109Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7357236Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.7357311Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7357352Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7357390Z unimplemented [] 2025-12-04T09:58:54.7357451Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7357552Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7358130Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7358168Z graph_break [] 2025-12-04T09:58:54.7358240Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7358282Z Autotune Choices Stats: 2025-12-04T09:58:54.7359014Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.7359196Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7359309Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7359469Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7360086Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7360701Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7361303Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7361904Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7362512Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7363110Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7363745Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7364350Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7364974Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7365576Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7365703Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.7365743Z Autotune Choices Stats: 2025-12-04T09:58:54.7366531Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.7366751Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7366916Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7367222Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7367874Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7368501Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7369138Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7369763Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7370388Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7371014Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7371642Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7372307Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7372931Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7373567Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7373695Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.7373771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7373813Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7373851Z unimplemented [] 2025-12-04T09:58:54.7373913Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7374014Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7374587Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7374625Z graph_break [] 2025-12-04T09:58:54.7374699Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7374738Z Autotune Choices Stats: 2025-12-04T09:58:54.7375479Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.7375625Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7375740Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7375900Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7376580Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7377190Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7377809Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7378418Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7379019Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7379621Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7380225Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7380871Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7381469Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7382078Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7382208Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.7382248Z Autotune Choices Stats: 2025-12-04T09:58:54.7383010Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.7383228Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7383392Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7383670Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7384301Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7384966Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7385589Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7386272Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7386902Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7387529Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7388152Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7388779Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7389442Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7390067Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7390195Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.7390268Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7390312Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7390348Z unimplemented [] 2025-12-04T09:58:54.7390420Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7390520Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7391101Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7391137Z graph_break [] 2025-12-04T09:58:54.7391211Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7391251Z Autotune Choices Stats: 2025-12-04T09:58:54.7391989Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.7392120Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7392232Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7392395Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7393030Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7393632Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7394243Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7394869Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7395466Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7396118Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7396720Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7397320Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7397990Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7398593Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7398723Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.7398763Z Autotune Choices Stats: 2025-12-04T09:58:54.7399549Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.7399767Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7399932Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7400209Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7400847Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7401473Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7402133Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7402756Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7403397Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7404024Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7404643Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7405279Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7405907Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7406617Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7406747Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.7406821Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7406863Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7406899Z unimplemented [] 2025-12-04T09:58:54.7406961Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7407060Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7407658Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7407695Z graph_break [] 2025-12-04T09:58:54.7407770Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7407809Z Autotune Choices Stats: 2025-12-04T09:58:54.7408546Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.7408674Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7408786Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7408946Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7409561Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7410210Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7410814Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7411412Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7412023Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7412629Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7413231Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7413834Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7414470Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7415086Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7415214Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.7415254Z Autotune Choices Stats: 2025-12-04T09:58:54.7416058Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.7416274Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7416443Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7416721Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7417347Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7417971Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7418596Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7419259Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7419885Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7420519Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7421154Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7421776Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7422403Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7423069Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7423197Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.7423271Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7423314Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7423351Z unimplemented [] 2025-12-04T09:58:54.7423412Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7423515Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7424091Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7424128Z graph_break [] 2025-12-04T09:58:54.7424200Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7424240Z Autotune Choices Stats: 2025-12-04T09:58:54.7424990Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.7425119Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7425232Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7425393Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7426051Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7426649Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7427304Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7427905Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7428506Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7429121Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7429730Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7430332Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7430931Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7431568Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7431695Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.7431735Z Autotune Choices Stats: 2025-12-04T09:58:54.7432509Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.7432727Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7432900Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7433177Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7433805Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7434434Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7435053Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7435703Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7436379Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7437013Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7437653Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7438277Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7438910Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7439534Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7439686Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.7439758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7439802Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7439838Z unimplemented [] 2025-12-04T09:58:54.7439901Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7440012Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7440584Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7440622Z graph_break [] 2025-12-04T09:58:54.7440697Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7440739Z Autotune Choices Stats: 2025-12-04T09:58:54.7441503Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.7441634Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7441750Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7441909Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7442521Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7443126Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7443736Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7444370Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7444967Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7445570Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7446231Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7446834Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7447435Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7448035Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7448190Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.7448232Z Autotune Choices Stats: 2025-12-04T09:58:54.7449005Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.7449223Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7449386Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7449663Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7450309Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7450938Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7451563Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7452190Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7452847Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7453472Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7454099Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7454740Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7455370Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7456037Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7456164Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.7456240Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7456282Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7456319Z unimplemented [] 2025-12-04T09:58:54.7456380Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7456513Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7457100Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7457139Z graph_break [] 2025-12-04T09:58:54.7457214Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7457256Z Autotune Choices Stats: 2025-12-04T09:58:54.7457987Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.7458116Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7458229Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7458402Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7459011Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7459612Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7460231Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7460833Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7461473Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7462076Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7462682Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7463294Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7463896Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7464500Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7464629Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.7464670Z Autotune Choices Stats: 2025-12-04T09:58:54.7465434Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.7465679Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7465845Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7466160Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7466792Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7467441Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7468066Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7468692Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7469313Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7469980Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7470600Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7471232Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7471877Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7472502Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7472630Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.7472705Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7472750Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7472789Z unimplemented [] 2025-12-04T09:58:54.7472849Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7472950Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7473526Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7473589Z graph_break [] 2025-12-04T09:58:54.7473666Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7473706Z Autotune Choices Stats: 2025-12-04T09:58:54.7474459Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.7474588Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7474702Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7474865Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7475488Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7476141Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7476742Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7477350Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7477959Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7478610Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7479210Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7479826Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7480432Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7481040Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7481168Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.7481208Z Autotune Choices Stats: 2025-12-04T09:58:54.7481968Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.7482205Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7482371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7482662Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7483303Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7483930Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7484565Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7485194Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7485826Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7486485Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7487144Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7487769Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7488409Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7489036Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7489167Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.7489246Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7489289Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7489330Z unimplemented [] 2025-12-04T09:58:54.7489392Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7489493Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7490063Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7490101Z graph_break [] 2025-12-04T09:58:54.7490176Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7490216Z Autotune Choices Stats: 2025-12-04T09:58:54.7490960Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:54.7491105Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7491218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7491381Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7491987Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7492610Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7493215Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7493826Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7494433Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7495036Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7495673Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7496318Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7496936Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7497540Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7497670Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:54.7497711Z Autotune Choices Stats: 2025-12-04T09:58:54.7498482Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.7498700Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7498866Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7499170Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7499818Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7500444Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7501083Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7501712Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7502340Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7502971Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7503599Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7504261Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7504889Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7505525Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7505658Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:54.7505733Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7505777Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7505813Z unimplemented [] 2025-12-04T09:58:54.7505876Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7506009Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7506587Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7506625Z graph_break [] 2025-12-04T09:58:54.7506700Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7506741Z Autotune Choices Stats: 2025-12-04T09:58:54.7507489Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.7507646Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7507761Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7507945Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7508549Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7509160Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7509784Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7510391Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7510991Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7511596Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7512209Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7512849Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7513450Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7514066Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7514199Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:54.7514240Z Autotune Choices Stats: 2025-12-04T09:58:54.7515013Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.7515232Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7515397Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7515675Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7516336Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7516997Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7517623Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7518264Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7518893Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7519525Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7520147Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7520776Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7521433Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7522053Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7522185Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:54.7522258Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7522302Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7522351Z unimplemented [] 2025-12-04T09:58:54.7522415Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7522515Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7523085Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7523122Z graph_break [] 2025-12-04T09:58:54.7523199Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7523239Z Autotune Choices Stats: 2025-12-04T09:58:54.7523988Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:54.7524117Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7524230Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7524395Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7525038Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7525643Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7526290Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7526910Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7527511Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7528118Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7528728Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7529390Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7529991Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7530595Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7530723Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:54.7530763Z Autotune Choices Stats: 2025-12-04T09:58:54.7531539Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.7531759Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7531924Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7532202Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7532829Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7533455Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7534109Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7534745Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7535391Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7536055Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7536682Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7537312Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7537980Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7538609Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7538740Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:54.7538813Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7538855Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7538891Z unimplemented [] 2025-12-04T09:58:54.7538954Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7539053Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7539641Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7539681Z graph_break [] 2025-12-04T09:58:54.7539755Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7539798Z Autotune Choices Stats: 2025-12-04T09:58:54.7540551Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:54.7540681Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7540796Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7540956Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7541569Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7542208Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7542813Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7543414Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7544027Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7544636Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7545243Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7545842Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7546527Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7547129Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7547259Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:54.7547300Z Autotune Choices Stats: 2025-12-04T09:58:54.7548076Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.7548294Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7548458Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7548733Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7549373Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7549998Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7550652Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7551282Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7551914Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7552551Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7553182Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7553807Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7554435Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7555086Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7555215Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:54.7555292Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7555334Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7555374Z unimplemented [] 2025-12-04T09:58:54.7555433Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7555533Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7556157Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7556195Z graph_break [] 2025-12-04T09:58:54.7556268Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7556308Z Autotune Choices Stats: 2025-12-04T09:58:54.7557068Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:54.7557195Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7557309Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7557474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7558083Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7558690Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7559418Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7560026Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7560633Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7561244Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7561853Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7562459Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7563060Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7563694Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7563823Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:54.7563864Z Autotune Choices Stats: 2025-12-04T09:58:54.7564630Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.7564848Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7565025Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7565301Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7566019Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7566648Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7567281Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7567942Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7568571Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7569199Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7569840Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7570472Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7571100Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7571729Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7571887Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:54.7571963Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7572006Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7572047Z unimplemented [] 2025-12-04T09:58:54.7572125Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7572225Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7572794Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7572836Z graph_break [] 2025-12-04T09:58:54.7572910Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7572952Z Autotune Choices Stats: 2025-12-04T09:58:54.7573702Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.7573830Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7573948Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7574110Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7574728Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7575335Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7575976Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7576627Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7577230Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7577845Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7578458Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7579066Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7579671Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7580276Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7580426Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:54.7580467Z Autotune Choices Stats: 2025-12-04T09:58:54.7581237Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.7581453Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7581619Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7581899Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7582547Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7583175Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7583803Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7584432Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7585093Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7585720Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7586402Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7587033Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7587658Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7588283Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7588410Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:54.7588487Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7588563Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7588603Z unimplemented [] 2025-12-04T09:58:54.7588664Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7588767Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7589360Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7589397Z graph_break [] 2025-12-04T09:58:54.7589472Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7589513Z Autotune Choices Stats: 2025-12-04T09:58:54.7590254Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:54.7590380Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7590503Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7590667Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7591284Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7591884Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7592490Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7593091Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7593730Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7594335Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7594941Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7595548Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7596178Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7596785Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7596913Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:54.7596956Z Autotune Choices Stats: 2025-12-04T09:58:54.7597739Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.7597980Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7598148Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7598427Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7599070Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7599698Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7600323Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7600946Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7601571Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7602234Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7602855Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7603492Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7604123Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7604754Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7604885Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:54.7604958Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7605002Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7605041Z unimplemented [] 2025-12-04T09:58:54.7605104Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7605203Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7605784Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7605852Z graph_break [] 2025-12-04T09:58:54.7605970Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7606011Z Autotune Choices Stats: 2025-12-04T09:58:54.7606764Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:54.7606895Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7607011Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7607175Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7607800Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7608407Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7609013Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7609620Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7610222Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7610864Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7611471Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7612088Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7612693Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7613295Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7613427Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:54.7613467Z Autotune Choices Stats: 2025-12-04T09:58:54.7614236Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.7614471Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7614646Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7614924Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7615558Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7616241Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7616865Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7617488Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7618119Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7618754Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7619414Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7620044Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7620684Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7621312Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7621440Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:54.7621515Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7621560Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7621596Z unimplemented [] 2025-12-04T09:58:54.7621657Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7621755Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7622330Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7622367Z graph_break [] 2025-12-04T09:58:54.7622443Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7622504Z Autotune Choices Stats: 2025-12-04T09:58:54.7623265Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:54.7623393Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7623507Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7623669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7624284Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7624913Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7625518Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7626157Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7626760Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7627401Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7628017Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7628623Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7629240Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7629847Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7629977Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:54.7630019Z Autotune Choices Stats: 2025-12-04T09:58:54.7630780Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.7630995Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7631159Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7631463Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7632107Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7632736Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7633371Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7634001Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7634627Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7635255Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7635883Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7636589Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7637215Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7637861Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7637992Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:54.7638085Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.7638138Z Traceback (most recent call last): 2025-12-04T09:58:54.7638293Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.7638335Z self.assertTrue( 2025-12-04T09:58:54.7638439Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.7638489Z raise self.failureException(msg) 2025-12-04T09:58:54.7638617Z AssertionError: False is not true : Log file /tmp/tmpjf4cuoyj/flex_attention_configs.json was not created 2025-12-04T09:58:54.7638624Z 2025-12-04T09:58:54.7638699Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.7638868Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.7638870Z 2025-12-04T09:58:54.7638960Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.7639039Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7639083Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7639124Z unimplemented [] 2025-12-04T09:58:54.7639185Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7639770Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.7639899Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7639939Z graph_break [] 2025-12-04T09:58:54.7640026Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7640518Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.7640570Z current_size = base.storage().size() 2025-12-04T09:58:54.7640612Z Autotune Choices Stats: 2025-12-04T09:58:54.7641352Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.7641481Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7641608Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7641773Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7642390Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7642991Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7643594Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7644195Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7644838Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7645438Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7646082Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7646687Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7647289Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7647893Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7648023Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.7648094Z Autotune Choices Stats: 2025-12-04T09:58:54.7648869Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.7649089Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7649258Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7649538Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7650184Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7650810Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7651432Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7652058Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7652688Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7653349Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7653969Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7654606Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7655235Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7655858Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7656024Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.7656102Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7656147Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7656185Z unimplemented [] 2025-12-04T09:58:54.7656246Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7656348Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7656922Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7656989Z graph_break [] 2025-12-04T09:58:54.7657065Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7657106Z Autotune Choices Stats: 2025-12-04T09:58:54.7657862Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.7657997Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7658119Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7658280Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7658914Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7659523Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7660123Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7660723Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7661333Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7661963Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7662566Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7663175Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7663781Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7664381Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7664511Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.7664555Z Autotune Choices Stats: 2025-12-04T09:58:54.7665312Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.7665549Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7665730Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7666046Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7666679Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7667317Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7667944Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7668562Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7669189Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7669813Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7670484Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7671108Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7671740Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7672370Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7672499Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.7672574Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7672623Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7672661Z unimplemented [] 2025-12-04T09:58:54.7672722Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7672821Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7673397Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7673433Z graph_break [] 2025-12-04T09:58:54.7673511Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7673575Z Autotune Choices Stats: 2025-12-04T09:58:54.7674325Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.7674460Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7674573Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7674737Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7675350Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7675999Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7676606Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7677207Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7677816Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7678437Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7679062Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7679663Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7680272Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7680878Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7681010Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.7681050Z Autotune Choices Stats: 2025-12-04T09:58:54.7681810Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.7682024Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7682189Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7682490Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7683131Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7683757Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7684398Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7685022Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7685651Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7686310Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7686942Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7687608Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7688234Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7688873Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7689005Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.7689078Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7689121Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7689160Z unimplemented [] 2025-12-04T09:58:54.7689221Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7689320Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7689899Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7689939Z graph_break [] 2025-12-04T09:58:54.7690014Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7690054Z Autotune Choices Stats: 2025-12-04T09:58:54.7690795Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.7690941Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7691055Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7691233Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7691848Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7692454Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7693070Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7693672Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7694277Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7694880Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7695517Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7696160Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7696758Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7697382Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7697515Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.7697557Z Autotune Choices Stats: 2025-12-04T09:58:54.7698321Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.7698546Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7698711Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7698992Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7699623Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7700294Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7700924Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7701556Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7702184Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7702817Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7703440Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7704099Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7704722Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7705351Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7705481Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.7705564Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7705611Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7705648Z unimplemented [] 2025-12-04T09:58:54.7705712Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7705811Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7706437Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7706477Z graph_break [] 2025-12-04T09:58:54.7706550Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7706592Z Autotune Choices Stats: 2025-12-04T09:58:54.7707334Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.7707464Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7707580Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7707770Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7708394Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7708996Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7709601Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7710215Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7710814Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7711414Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7712014Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7712646Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7713243Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7713844Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7713974Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.7714014Z Autotune Choices Stats: 2025-12-04T09:58:54.7714789Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.7715007Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7715172Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7715458Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7716204Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7716874Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7717503Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7718131Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7718770Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7719396Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7720027Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7720650Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7721310Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7721933Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7722062Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.7722138Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7722182Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7722221Z unimplemented [] 2025-12-04T09:58:54.7722285Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7722384Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7722970Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7723010Z graph_break [] 2025-12-04T09:58:54.7723083Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7723126Z Autotune Choices Stats: 2025-12-04T09:58:54.7723865Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.7723995Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7724109Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7724269Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7724881Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7725525Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7726164Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7726775Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7727398Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7728005Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7728615Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7729214Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7729849Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7730450Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7730581Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.7730624Z Autotune Choices Stats: 2025-12-04T09:58:54.7731398Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.7731615Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7731779Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7732056Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7732685Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7733312Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7733965Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7734602Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7735236Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7735882Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7736565Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7737196Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7737822Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7738471Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7738601Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.7738679Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7738724Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7738761Z unimplemented [] 2025-12-04T09:58:54.7738823Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7738922Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7739510Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7739549Z graph_break [] 2025-12-04T09:58:54.7739623Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7739679Z Autotune Choices Stats: 2025-12-04T09:58:54.7740416Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.7740545Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7740663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7740827Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7741440Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7742046Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7742679Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7743280Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7743896Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7744516Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7745121Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7745724Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7746375Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7747016Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7747145Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.7747189Z Autotune Choices Stats: 2025-12-04T09:58:54.7747944Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.7748162Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7748348Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7748624Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7756041Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7756701Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7757327Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7758017Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7758643Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7759297Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7759942Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7760575Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7761204Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7761823Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7761974Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.7762055Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7762100Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7762149Z unimplemented [] 2025-12-04T09:58:54.7762211Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7762316Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7762897Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7762938Z graph_break [] 2025-12-04T09:58:54.7763014Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7763056Z Autotune Choices Stats: 2025-12-04T09:58:54.7763806Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.7763936Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7764058Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7764221Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7764837Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7765441Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7766079Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7766720Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7767324Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7767938Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7768541Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7769148Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7769748Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7770352Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7770502Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.7770545Z Autotune Choices Stats: 2025-12-04T09:58:54.7771313Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.7771534Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7771701Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7771982Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7772626Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7773250Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7773870Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7774498Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7775154Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7775777Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7776447Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7777084Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7777707Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7778335Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7778464Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.7778541Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7778607Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7778646Z unimplemented [] 2025-12-04T09:58:54.7778707Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7778811Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7779396Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7779435Z graph_break [] 2025-12-04T09:58:54.7779510Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7779551Z Autotune Choices Stats: 2025-12-04T09:58:54.7780294Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.7780420Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7780546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7780710Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7781320Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7781919Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7782525Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7783118Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7783751Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7784351Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7784964Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7785565Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7786212Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7786816Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7786944Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.7786987Z Autotune Choices Stats: 2025-12-04T09:58:54.7787771Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.7788018Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7788191Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7788469Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7789111Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7789739Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7790361Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7790982Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7791607Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7792261Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7792879Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7793510Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7794147Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7794771Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7794903Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.7794977Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7795020Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7795059Z unimplemented [] 2025-12-04T09:58:54.7795120Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7795220Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7795796Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7795849Z graph_break [] 2025-12-04T09:58:54.7795972Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7796013Z Autotune Choices Stats: 2025-12-04T09:58:54.7796781Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.7796911Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7797024Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7797190Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7797810Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7798420Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7799016Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7799618Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7800218Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7800864Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7801465Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7802070Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7802677Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7803282Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7803413Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.7803452Z Autotune Choices Stats: 2025-12-04T09:58:54.7804202Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.7804444Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7804619Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7804901Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7805532Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7806209Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7806829Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7807459Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7808085Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7808709Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7809362Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7809993Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7810621Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7811241Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7811371Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.7811444Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7811489Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7811528Z unimplemented [] 2025-12-04T09:58:54.7811591Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7811690Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7812264Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7812301Z graph_break [] 2025-12-04T09:58:54.7812374Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7812416Z Autotune Choices Stats: 2025-12-04T09:58:54.7813182Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.7813309Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7813424Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7813590Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7814211Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7814827Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7815428Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7816064Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7816659Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7817257Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7817902Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7818509Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7819130Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7819731Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7819860Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.7819899Z Autotune Choices Stats: 2025-12-04T09:58:54.7820660Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.7820879Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7821045Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7821348Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7821983Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7822605Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7823246Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7823869Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7824492Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7825123Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7825744Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7826438Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7827065Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7827701Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7827833Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.7827907Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7827952Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7827990Z unimplemented [] 2025-12-04T09:58:54.7828053Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7828152Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7828732Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7828770Z graph_break [] 2025-12-04T09:58:54.7828843Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7828886Z Autotune Choices Stats: 2025-12-04T09:58:54.7829631Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.7829781Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7829895Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7830066Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7830678Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7831280Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7831901Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7832503Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7833100Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7833714Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7834337Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7834957Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7835556Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7836211Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7836344Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.7836384Z Autotune Choices Stats: 2025-12-04T09:58:54.7837135Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.7837354Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7837519Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7837797Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7838432Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7839086Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7839709Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7840350Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7840982Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7841606Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7842234Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7842859Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7843511Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7844138Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7844268Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.7844343Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7844395Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7844442Z unimplemented [] 2025-12-04T09:58:54.7844503Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7844602Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7845176Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7845213Z graph_break [] 2025-12-04T09:58:54.7845285Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7845324Z Autotune Choices Stats: 2025-12-04T09:58:54.7846092Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.7846218Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7846333Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7846519Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7847139Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7847742Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7848342Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7848970Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7849577Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7850179Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7850778Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7851419Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7852018Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7852619Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7852749Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.7852788Z Autotune Choices Stats: 2025-12-04T09:58:54.7853554Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.7853772Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7853936Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7854214Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7854850Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7855471Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7856158Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7856783Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7857428Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7858059Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7858681Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7859311Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7859968Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7860587Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7860715Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.7860791Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7860833Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7860871Z unimplemented [] 2025-12-04T09:58:54.7860932Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7861031Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7861617Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7861657Z graph_break [] 2025-12-04T09:58:54.7861731Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7861771Z Autotune Choices Stats: 2025-12-04T09:58:54.7862511Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.7862639Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7862753Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7862915Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7863525Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7864152Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7864753Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7865358Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7866014Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7866616Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7867221Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7867830Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7868474Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7869074Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7869204Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.7869245Z Autotune Choices Stats: 2025-12-04T09:58:54.7870014Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.7870233Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7870399Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7870676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7871307Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7871934Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7872583Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7873207Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7873837Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7874473Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7875093Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7875724Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7876389Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7877051Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7877178Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.7877254Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7877296Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7877335Z unimplemented [] 2025-12-04T09:58:54.7877395Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7877494Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7878069Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7878106Z graph_break [] 2025-12-04T09:58:54.7878178Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7878218Z Autotune Choices Stats: 2025-12-04T09:58:54.7878968Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.7879095Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7879211Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7879374Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7879988Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7880588Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7881218Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7881818Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7882428Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7883050Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7883651Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7884263Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7884871Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7885504Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7885630Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.7885674Z Autotune Choices Stats: 2025-12-04T09:58:54.7886470Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.7886688Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7886870Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7887150Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7887780Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7888402Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7889025Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7889692Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7890321Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7890948Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7891584Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7892205Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7892835Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7893464Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7893615Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.7893690Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7893731Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7893768Z unimplemented [] 2025-12-04T09:58:54.7893841Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7893943Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7894522Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7894559Z graph_break [] 2025-12-04T09:58:54.7894635Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7894676Z Autotune Choices Stats: 2025-12-04T09:58:54.7895430Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.7895561Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7895677Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7895840Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7896487Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7898499Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7899116Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7899792Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7900394Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7901001Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7901611Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7902224Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7902838Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7903532Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7903690Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.7903736Z Autotune Choices Stats: 2025-12-04T09:58:54.7904510Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.7904736Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7904904Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7905186Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7905824Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7906501Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7907140Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7907802Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7908481Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7909128Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7909761Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7910392Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7911028Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7911655Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7911806Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.7911892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7911938Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7912010Z unimplemented [] 2025-12-04T09:58:54.7912076Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7912189Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7912789Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7912835Z graph_break [] 2025-12-04T09:58:54.7912913Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7912962Z Autotune Choices Stats: 2025-12-04T09:58:54.7913697Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.7913832Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7913955Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7914118Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7914744Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7915357Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7916015Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7916644Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7917311Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7917920Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7918527Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7919137Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7919735Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7920334Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7920481Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.7920527Z Autotune Choices Stats: 2025-12-04T09:58:54.7921301Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.7921547Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7921716Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7922004Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7922641Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7923267Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7923898Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7924529Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7925174Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7925843Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7926519Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7927150Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7927788Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7928420Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7928551Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.7928633Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7928678Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7928723Z unimplemented [] 2025-12-04T09:58:54.7928787Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7928894Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7929485Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7929568Z graph_break [] 2025-12-04T09:58:54.7929645Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7929692Z Autotune Choices Stats: 2025-12-04T09:58:54.7930459Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.7930591Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7930717Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7930880Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7931498Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7932120Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7932723Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7933352Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7933958Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7934621Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7935226Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7935836Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7936482Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7937095Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7937226Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.7937276Z Autotune Choices Stats: 2025-12-04T09:58:54.7938046Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.7938303Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7938488Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7938768Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7939405Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7940038Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7940674Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7941298Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7941946Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7942572Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7943229Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7943864Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7944491Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7945121Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7945251Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.7945330Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7945374Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7945419Z unimplemented [] 2025-12-04T09:58:54.7945482Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7945586Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7946337Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7946379Z graph_break [] 2025-12-04T09:58:54.7946458Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7946499Z Autotune Choices Stats: 2025-12-04T09:58:54.7947258Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.7947420Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7947538Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7947699Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7948318Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7948925Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7949539Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7950139Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7950760Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7951371Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7952008Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7952641Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7953266Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7953901Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7954033Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.7954080Z Autotune Choices Stats: 2025-12-04T09:58:54.7954843Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.7955075Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7955245Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7955551Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7956231Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7956860Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7957489Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7958120Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7958753Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7959407Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7960037Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7960709Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7961343Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7961976Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7962112Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.7962194Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7962244Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7962284Z unimplemented [] 2025-12-04T09:58:54.7962348Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7962454Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7963032Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7963074Z graph_break [] 2025-12-04T09:58:54.7963156Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7963200Z Autotune Choices Stats: 2025-12-04T09:58:54.7963967Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.7964127Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7964241Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7964423Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7965034Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7965642Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7966289Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7966904Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7967508Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7968143Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7968752Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7969396Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7970004Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7970613Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7970748Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.7970791Z Autotune Choices Stats: 2025-12-04T09:58:54.7971555Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.7971776Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7971948Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7972284Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7972913Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7973576Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7974206Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7974831Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7975463Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7976133Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7976787Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7977414Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7978086Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7978713Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7978849Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.7978926Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7978974Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7979016Z unimplemented [] 2025-12-04T09:58:54.7979083Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7979187Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7979761Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.7979802Z graph_break [] 2025-12-04T09:58:54.7979884Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7979927Z Autotune Choices Stats: 2025-12-04T09:58:54.7980675Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.7980822Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7980938Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7981106Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7981757Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7982363Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7982977Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7983585Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7984194Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7984869Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7985498Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7986181Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7986776Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7987389Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7987522Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.7987565Z Autotune Choices Stats: 2025-12-04T09:58:54.7988324Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.7988546Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7988712Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7988991Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7989642Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7990265Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7990939Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7991568Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7992197Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7992829Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7993461Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.7994104Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7994763Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7995393Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7995530Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.7995607Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.7995655Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.7995696Z unimplemented [] 2025-12-04T09:58:54.7995765Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.7995865Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.7996487Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.7996531Z graph_break [] 2025-12-04T09:58:54.7996607Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.7996654Z Autotune Choices Stats: 2025-12-04T09:58:54.7997397Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.7997532Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.7997648Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.7997813Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.7998456Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7999096Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.7999702Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8000313Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8000918Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8001526Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8002133Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8002757Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8003400Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8004007Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8004143Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.8004186Z Autotune Choices Stats: 2025-12-04T09:58:54.8004949Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.8005172Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8005339Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8005617Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8006289Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8006930Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8007601Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8008230Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8008862Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8009494Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8010123Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8010757Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8011404Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8012066Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8012204Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.8012280Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8012329Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8012369Z unimplemented [] 2025-12-04T09:58:54.8012438Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8012556Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8013126Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8013169Z graph_break [] 2025-12-04T09:58:54.8013246Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8013292Z Autotune Choices Stats: 2025-12-04T09:58:54.8014047Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.8014176Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8014298Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8014459Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8015089Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8015709Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8016385Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8016994Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8017604Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8018211Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8018818Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8019426Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8020049Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8020693Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8020824Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.8020872Z Autotune Choices Stats: 2025-12-04T09:58:54.8021634Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.8021859Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8022029Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8022311Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8022942Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8023576Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8024221Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8024884Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8025516Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8026188Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8026814Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8027447Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8028077Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8028727Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8028893Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.8028976Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8029021Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8029066Z unimplemented [] 2025-12-04T09:58:54.8029142Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8029248Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8029830Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8029871Z graph_break [] 2025-12-04T09:58:54.8029952Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8029995Z Autotune Choices Stats: 2025-12-04T09:58:54.8030738Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.8030868Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8030990Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8031155Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8031773Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8032388Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8033005Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8033647Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8034254Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8034864Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8035472Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8036120Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8036726Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8037347Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8037501Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.8037547Z Autotune Choices Stats: 2025-12-04T09:58:54.8038324Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.8038544Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8038715Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8038994Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8039630Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8040264Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8040900Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8041538Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8042212Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8042848Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8043482Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8044111Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8044750Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8045390Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8045536Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.8045612Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8045662Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8045730Z unimplemented [] 2025-12-04T09:58:54.8045797Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8045899Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8046530Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8046570Z graph_break [] 2025-12-04T09:58:54.8046650Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8046693Z Autotune Choices Stats: 2025-12-04T09:58:54.8047443Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.8047580Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8047697Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8047862Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8048479Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8049090Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8049714Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8050323Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8050968Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8051575Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8052190Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8052796Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8053407Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8054015Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8054160Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.8054203Z Autotune Choices Stats: 2025-12-04T09:58:54.8054969Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.8055208Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8055378Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8055659Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8056334Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8056963Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8057590Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8058223Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8058875Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8059546Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8060170Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8060811Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8061442Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8062076Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8062214Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.8062290Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8062339Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8062379Z unimplemented [] 2025-12-04T09:58:54.8062448Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8062567Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8063148Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8063212Z graph_break [] 2025-12-04T09:58:54.8063292Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8063334Z Autotune Choices Stats: 2025-12-04T09:58:54.8064088Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:54.8064222Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8064337Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8064505Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8065118Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8065731Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8066369Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8067002Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8067605Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8068251Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8068866Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8069478Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8070082Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8070692Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8070828Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:54.8070873Z Autotune Choices Stats: 2025-12-04T09:58:54.8071652Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.8071894Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8072077Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8072360Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8072991Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8073625Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8074255Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8074884Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8075533Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8076206Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8076880Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8077514Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8078149Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8078780Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8078915Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:54.8078992Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8079040Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8079080Z unimplemented [] 2025-12-04T09:58:54.8079149Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8079251Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8079845Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8079889Z graph_break [] 2025-12-04T09:58:54.8079965Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8080013Z Autotune Choices Stats: 2025-12-04T09:58:54.8080793Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.8080924Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8081040Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8081208Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8081825Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8082436Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8083049Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8083658Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8084290Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8084895Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8085532Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8086170Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8086781Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8087387Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8087523Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:54.8087567Z Autotune Choices Stats: 2025-12-04T09:58:54.8088332Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.8088574Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8088741Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8089048Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8089700Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8090328Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8090959Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8091593Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8092220Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8092869Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8093499Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8094172Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8094804Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8095433Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8095568Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:54.8095649Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8095694Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8095738Z unimplemented [] 2025-12-04T09:58:54.8095801Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8095906Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8096513Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8096561Z graph_break [] 2025-12-04T09:58:54.8096636Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8096683Z Autotune Choices Stats: 2025-12-04T09:58:54.8097442Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:54.8097604Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8097726Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8097902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8098517Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8099130Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8099737Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8100353Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8100964Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8101575Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8102223Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8102826Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8103436Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8104045Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8104181Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:54.8104229Z Autotune Choices Stats: 2025-12-04T09:58:54.8104986Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.8105210Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8105381Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8105672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8106377Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8107050Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8107675Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8108305Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8108947Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8109580Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8110216Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8110885Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8111518Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8112157Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8112287Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:54.8112368Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8112415Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8112461Z unimplemented [] 2025-12-04T09:58:54.8112524Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8112630Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8113211Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8113252Z graph_break [] 2025-12-04T09:58:54.8113328Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8113374Z Autotune Choices Stats: 2025-12-04T09:58:54.8114118Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:54.8114258Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8114379Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8114560Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8115189Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8115798Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8116442Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8117051Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8117658Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8118271Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8118891Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8119538Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8120145Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8120762Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8120893Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:54.8120941Z Autotune Choices Stats: 2025-12-04T09:58:54.8121695Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.8121917Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8122087Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8122369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8123022Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8123687Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8124312Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8124943Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8125580Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8126250Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8126881Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8127541Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8128215Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8128846Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8128978Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:54.8129059Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8129106Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8129151Z unimplemented [] 2025-12-04T09:58:54.8129213Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8129320Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8129900Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8129947Z graph_break [] 2025-12-04T09:58:54.8130027Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8130070Z Autotune Choices Stats: 2025-12-04T09:58:54.8130820Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:54.8130950Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8131070Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8131233Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8131853Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8132494Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8133104Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8133711Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8134322Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8134935Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8135551Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8136206Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8136844Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8137457Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8137588Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:54.8137634Z Autotune Choices Stats: 2025-12-04T09:58:54.8138397Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.8138615Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8138786Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8139066Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8139705Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8140347Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8141018Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8141643Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8142287Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8142920Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8143554Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8144194Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8144838Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8145502Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8145638Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:54.8145715Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8145768Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8145808Z unimplemented [] 2025-12-04T09:58:54.8145874Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8146014Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8146599Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8146639Z graph_break [] 2025-12-04T09:58:54.8146720Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8146764Z Autotune Choices Stats: 2025-12-04T09:58:54.8147517Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.8147652Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8147766Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8147932Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8148567Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8149175Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8149823Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8150431Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8151041Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8151647Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8152267Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8152875Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8153494Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8154127Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8154263Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:54.8154306Z Autotune Choices Stats: 2025-12-04T09:58:54.8155066Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.8155290Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8155461Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8155742Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8156409Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8157042Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8157691Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8158357Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8158986Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8159620Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8160256Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8160892Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8161521Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8162164Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8162320Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:54.8162397Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8162455Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8162495Z unimplemented [] 2025-12-04T09:58:54.8162564Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8162667Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8163252Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8163296Z graph_break [] 2025-12-04T09:58:54.8163372Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8163419Z Autotune Choices Stats: 2025-12-04T09:58:54.8164162Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:54.8164297Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8164414Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8164582Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8165193Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8165805Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8166446Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8167109Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8167719Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8168323Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8168936Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8169547Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8170167Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8170774Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8170932Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:54.8170974Z Autotune Choices Stats: 2025-12-04T09:58:54.8171751Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.8171975Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8172144Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8172429Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8173066Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8173696Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8174325Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8174962Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8175638Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8176293Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8176917Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8177557Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8178187Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8178840Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8178974Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:54.8179051Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8179126Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8179166Z unimplemented [] 2025-12-04T09:58:54.8179234Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8179336Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8179937Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8179979Z graph_break [] 2025-12-04T09:58:54.8180055Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8180105Z Autotune Choices Stats: 2025-12-04T09:58:54.8180849Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:54.8180982Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8181103Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8181265Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8181883Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8182488Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8183104Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8183709Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8184354Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8184958Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8185566Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8186210Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8186814Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8187433Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8187568Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:54.8187640Z Autotune Choices Stats: 2025-12-04T09:58:54.8188415Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.8188641Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8188808Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8189089Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8189727Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8190353Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8190985Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8191632Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8192265Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8192928Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8193564Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8194197Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8194827Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8195461Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8195596Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:54.8195677Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8195721Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8195766Z unimplemented [] 2025-12-04T09:58:54.8195841Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8195991Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8196569Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8196641Z graph_break [] 2025-12-04T09:58:54.8196716Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8196775Z Autotune Choices Stats: 2025-12-04T09:58:54.8197522Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:54.8197657Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8197778Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8197942Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8198556Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8199164Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8199770Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8200391Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8201021Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8201643Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8202252Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8202859Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8203469Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8204080Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8204211Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:54.8204259Z Autotune Choices Stats: 2025-12-04T09:58:54.8205024Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.8205267Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8205453Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8205729Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8206405Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8207040Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8207667Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8208298Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8208968Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8209601Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8210281Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8210917Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8211558Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8212190Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8212321Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:54.8212406Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8212451Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8212494Z unimplemented [] 2025-12-04T09:58:54.8212559Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8212665Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8213260Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8213304Z graph_break [] 2025-12-04T09:58:54.8213382Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8213449Z Autotune Choices Stats: 2025-12-04T09:58:54.8214212Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:54.8214340Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8214459Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8214623Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8215240Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8215853Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8216501Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8217104Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8217740Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8218386Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8218992Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8219602Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8220219Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8220828Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8220958Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:54.8221007Z Autotune Choices Stats: 2025-12-04T09:58:54.8221783Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.8222005Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8222201Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8222483Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8223134Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8223765Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8224393Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8225027Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8225660Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8226356Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8227032Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8227661Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8228295Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8228922Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8229059Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:54.8229162Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.8229212Z Traceback (most recent call last): 2025-12-04T09:58:54.8229374Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.8229418Z self.assertTrue( 2025-12-04T09:58:54.8229534Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.8229586Z raise self.failureException(msg) 2025-12-04T09:58:54.8229718Z AssertionError: False is not true : Log file /tmp/tmpis9kuz2a/flex_attention_configs.json was not created 2025-12-04T09:58:54.8229722Z 2025-12-04T09:58:54.8229804Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.8229971Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.8229974Z 2025-12-04T09:58:54.8230071Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.8230150Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8230210Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8230250Z unimplemented [] 2025-12-04T09:58:54.8230316Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8230903Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.8231030Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8231069Z graph_break [] 2025-12-04T09:58:54.8231161Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8231655Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.8231711Z current_size = base.storage().size() 2025-12-04T09:58:54.8231758Z Autotune Choices Stats: 2025-12-04T09:58:54.8232507Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.8232642Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8232761Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8232929Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8233540Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8234149Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8234768Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8235392Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8236051Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8236662Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8237264Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8237876Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8238473Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8239094Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8239230Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.8239311Z Autotune Choices Stats: 2025-12-04T09:58:54.8240086Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.8240308Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8240478Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8240757Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8241392Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8242023Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8242648Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8243294Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8243928Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8244586Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8245210Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8245838Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8246508Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8247128Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8247263Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.8247340Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8247389Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8247427Z unimplemented [] 2025-12-04T09:58:54.8247519Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8247621Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8248200Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8248266Z graph_break [] 2025-12-04T09:58:54.8248355Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8248402Z Autotune Choices Stats: 2025-12-04T09:58:54.8249137Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.8249267Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8249383Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8249547Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8250160Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8250767Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8251368Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8251983Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8252614Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8253216Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8253813Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8254416Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8255022Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8255622Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8255758Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.8255801Z Autotune Choices Stats: 2025-12-04T09:58:54.8256619Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.8256866Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8257045Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8257324Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8257960Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8258583Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8259211Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8259834Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8260474Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8261101Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8261759Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8262388Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8263016Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8263643Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8263773Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.8263855Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8263898Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8263938Z unimplemented [] 2025-12-04T09:58:54.8263999Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8264100Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8264684Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8264725Z graph_break [] 2025-12-04T09:58:54.8264801Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8264863Z Autotune Choices Stats: 2025-12-04T09:58:54.8265615Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.8269575Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8269698Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8269866Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8270492Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8271099Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8271712Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8272315Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8272967Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8273621Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8274228Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8274838Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8275438Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8276099Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8276229Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.8276273Z Autotune Choices Stats: 2025-12-04T09:58:54.8277049Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.8277272Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8277463Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8277747Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8278399Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8279028Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8279654Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8280288Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8280918Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8281560Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8282217Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8282852Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8283482Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8284106Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8284236Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.8284314Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8284356Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8284394Z unimplemented [] 2025-12-04T09:58:54.8284456Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8284557Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8285143Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8285181Z graph_break [] 2025-12-04T09:58:54.8285255Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8285296Z Autotune Choices Stats: 2025-12-04T09:58:54.8286093Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.8286250Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8286380Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8286543Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8287166Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8287789Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8288397Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8289014Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8289627Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8290245Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8290885Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8291490Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8292096Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8292706Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8292836Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.8292876Z Autotune Choices Stats: 2025-12-04T09:58:54.8293636Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.8293859Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8294035Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8294313Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8294984Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8295614Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8296279Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8296910Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8297550Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8298182Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8298827Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8299510Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8300139Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8300771Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8300899Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.8300975Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8301017Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8301055Z unimplemented [] 2025-12-04T09:58:54.8301117Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8301218Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8301806Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8301844Z graph_break [] 2025-12-04T09:58:54.8301918Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8301959Z Autotune Choices Stats: 2025-12-04T09:58:54.8302719Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.8302847Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8302964Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8303147Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8303778Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8304385Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8304995Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8305598Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8306246Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8306857Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8307479Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8308134Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8308744Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8309354Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8309483Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.8309525Z Autotune Choices Stats: 2025-12-04T09:58:54.8310285Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.8310506Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8310671Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8310955Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8311603Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8312261Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8312892Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8313525Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8314149Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8314784Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8315418Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8316111Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8316777Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8317402Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8317533Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.8317608Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8317650Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8317687Z unimplemented [] 2025-12-04T09:58:54.8317749Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8317849Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8318426Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8318464Z graph_break [] 2025-12-04T09:58:54.8318538Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8318578Z Autotune Choices Stats: 2025-12-04T09:58:54.8319326Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.8319454Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8319570Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8319755Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8320562Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8321208Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8321840Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8322472Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8323108Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8323714Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8324320Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8324945Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8325577Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8326216Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8326350Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.8326389Z Autotune Choices Stats: 2025-12-04T09:58:54.8327158Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.8327380Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8327545Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8327825Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8328463Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8329119Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8329785Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8330416Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8331049Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8331687Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8332316Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8332948Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8333593Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8334255Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8334385Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.8334459Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8334504Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8334540Z unimplemented [] 2025-12-04T09:58:54.8334601Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8334700Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8335282Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8335318Z graph_break [] 2025-12-04T09:58:54.8335393Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8335434Z Autotune Choices Stats: 2025-12-04T09:58:54.8336212Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.8336342Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8336455Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8336620Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8337250Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8337863Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8338509Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8339106Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8339715Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8340329Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8340936Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8341544Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8342160Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8342809Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8342938Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.8342979Z Autotune Choices Stats: 2025-12-04T09:58:54.8343745Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.8343963Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8344126Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8344413Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8345051Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8345688Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8346379Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8347069Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8347696Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8348328Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8348957Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8349593Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8350236Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8350864Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8351013Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.8351087Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8351140Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8351177Z unimplemented [] 2025-12-04T09:58:54.8351239Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8351339Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8351913Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8351950Z graph_break [] 2025-12-04T09:58:54.8352024Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8352064Z Autotune Choices Stats: 2025-12-04T09:58:54.8352813Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.8352942Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8353055Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8353221Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8353846Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8354465Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8355069Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8355715Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8356355Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8356963Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8357571Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8358187Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8358819Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8359424Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8359579Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.8359619Z Autotune Choices Stats: 2025-12-04T09:58:54.8360403Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.8360626Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8360791Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8361071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8361710Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8362359Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8363005Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8363634Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8364301Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8364927Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8365558Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8366230Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8366858Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8367509Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8367638Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.8367712Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8367782Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8367818Z unimplemented [] 2025-12-04T09:58:54.8367880Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8367980Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8368581Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8368620Z graph_break [] 2025-12-04T09:58:54.8368692Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8368736Z Autotune Choices Stats: 2025-12-04T09:58:54.8369491Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.8369619Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8369731Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8369895Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8370510Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8371125Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8371744Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8372347Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8372994Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8373602Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8374206Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8374811Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8375424Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8376075Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8376204Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.8376268Z Autotune Choices Stats: 2025-12-04T09:58:54.8377043Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.8377263Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8377428Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8377711Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8378346Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8378973Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8379603Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8380241Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8380873Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8381531Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8382180Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8382807Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8383438Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8384072Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8384202Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.8384276Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8384319Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8384356Z unimplemented [] 2025-12-04T09:58:54.8384417Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8384527Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8385099Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8385158Z graph_break [] 2025-12-04T09:58:54.8385231Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8385286Z Autotune Choices Stats: 2025-12-04T09:58:54.8386074Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.8386203Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8386320Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8386482Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8387098Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8387705Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8388307Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8388946Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8389578Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8390199Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8390801Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8391406Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8392015Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8392620Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8392751Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.8392792Z Autotune Choices Stats: 2025-12-04T09:58:54.8393567Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.8393806Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8393983Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8394263Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8394899Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8395531Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8396188Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8396816Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8397465Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8398093Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8398753Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8399385Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8400022Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8400656Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8400784Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.8400858Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8400902Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8400940Z unimplemented [] 2025-12-04T09:58:54.8401000Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8401102Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8401695Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8401734Z graph_break [] 2025-12-04T09:58:54.8401809Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8401874Z Autotune Choices Stats: 2025-12-04T09:58:54.8402641Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.8402768Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8402883Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8403045Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8403663Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8404274Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8404881Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8405490Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8406138Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8406791Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8407400Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8408008Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8408616Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8409229Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8409359Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.8409399Z Autotune Choices Stats: 2025-12-04T09:58:54.8410167Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.8410403Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8410566Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8410868Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8411516Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8412147Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8412785Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8413417Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8414049Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8414692Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8415337Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8416012Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8416650Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8417282Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8417412Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.8417487Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8417529Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8417567Z unimplemented [] 2025-12-04T09:58:54.8417628Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8417730Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8418309Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8418348Z graph_break [] 2025-12-04T09:58:54.8418422Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8418462Z Autotune Choices Stats: 2025-12-04T09:58:54.8419232Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.8419391Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8419505Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8419677Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8420297Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8420913Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8421521Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8422127Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8422731Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8423355Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8423990Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8424596Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8425312Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8425954Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8426085Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.8426126Z Autotune Choices Stats: 2025-12-04T09:58:54.8426897Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.8427118Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8427284Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8427592Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8428256Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8428899Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8429528Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8430159Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8430790Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8431419Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8432060Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8432727Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8433367Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8433996Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8434126Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.8434201Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8434246Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8434283Z unimplemented [] 2025-12-04T09:58:54.8434344Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8434456Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8435045Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8435082Z graph_break [] 2025-12-04T09:58:54.8435156Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8435198Z Autotune Choices Stats: 2025-12-04T09:58:54.8436022Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.8436150Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8436262Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8436464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8437094Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8437702Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8438309Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8438920Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8439529Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8440139Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8440758Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8441400Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8442009Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8442616Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8442746Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.8442785Z Autotune Choices Stats: 2025-12-04T09:58:54.8443556Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.8443778Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8443943Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8444225Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8444874Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8445538Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8446211Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8446843Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8447475Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8448112Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8448755Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8449400Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8450070Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8450704Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8450838Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.8450912Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8450956Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8451000Z unimplemented [] 2025-12-04T09:58:54.8451061Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8451161Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8451736Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8451775Z graph_break [] 2025-12-04T09:58:54.8451849Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8451888Z Autotune Choices Stats: 2025-12-04T09:58:54.8452635Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.8452766Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8452879Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8453044Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8453672Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8454309Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8454919Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8455522Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8456167Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8456775Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8457384Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8458022Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8458679Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8459284Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8459415Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.8459455Z Autotune Choices Stats: 2025-12-04T09:58:54.8460218Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.8460441Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8460604Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8460887Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8461523Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8462168Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8462826Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8463454Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8464092Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8464723Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8465352Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8466018Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8466662Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8467334Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8467462Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.8467536Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8467580Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8467617Z unimplemented [] 2025-12-04T09:58:54.8467677Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8467776Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8468359Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8468397Z graph_break [] 2025-12-04T09:58:54.8468470Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8468513Z Autotune Choices Stats: 2025-12-04T09:58:54.8469278Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.8469408Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8469521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8469685Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8470312Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8470916Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8471549Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8472150Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8472751Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8473356Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8473963Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8474564Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8475176Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8475807Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8475971Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.8476011Z Autotune Choices Stats: 2025-12-04T09:58:54.8476773Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.8476989Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8477153Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8477431Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8478064Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8478689Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8479330Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8479994Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8480616Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8481237Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8481862Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8482490Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8483116Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8483747Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8483906Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.8483981Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8484033Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8484070Z unimplemented [] 2025-12-04T09:58:54.8484132Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8484232Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8484808Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8484848Z graph_break [] 2025-12-04T09:58:54.8484922Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8484964Z Autotune Choices Stats: 2025-12-04T09:58:54.8485707Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.8485836Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8485984Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8486147Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8486764Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8487381Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8487983Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8488627Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8489227Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8489834Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8490438Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8491050Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8491662Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8492261Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8492418Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.8492457Z Autotune Choices Stats: 2025-12-04T09:58:54.8493230Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.8493447Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8493611Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8493887Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8494515Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8495144Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8495768Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8496439Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8497103Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8497731Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8498356Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8498988Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8499624Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8500259Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8500388Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.8500463Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8500542Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8500582Z unimplemented [] 2025-12-04T09:58:54.8500643Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8500745Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8501330Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8501368Z graph_break [] 2025-12-04T09:58:54.8501440Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8501485Z Autotune Choices Stats: 2025-12-04T09:58:54.8502229Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.8502356Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8502470Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8502633Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8503244Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8503847Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8504462Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8505064Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8505698Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8506350Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8506951Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8507552Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8508155Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8508776Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8508905Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.8508945Z Autotune Choices Stats: 2025-12-04T09:58:54.8509741Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.8509959Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8510125Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8510405Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8511044Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8511674Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8512295Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8512934Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8513560Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8514215Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8514839Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8515470Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8516138Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8516765Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8516894Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.8516969Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8517010Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8517050Z unimplemented [] 2025-12-04T09:58:54.8517109Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8517230Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8517803Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8517870Z graph_break [] 2025-12-04T09:58:54.8517942Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8517982Z Autotune Choices Stats: 2025-12-04T09:58:54.8518739Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.8518865Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8518979Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8519139Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8519752Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8520355Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8520957Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8521576Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8522173Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8522809Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8523416Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8524021Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8524620Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8525223Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8525353Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.8525394Z Autotune Choices Stats: 2025-12-04T09:58:54.8526194Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.8526441Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8526639Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8526916Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8527561Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8528193Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8528819Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8529446Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8530094Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8530718Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8531378Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8532018Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8532647Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8533275Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8533408Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.8533490Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8533537Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8533582Z unimplemented [] 2025-12-04T09:58:54.8533645Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8533749Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8534340Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8534385Z graph_break [] 2025-12-04T09:58:54.8534465Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8534529Z Autotune Choices Stats: 2025-12-04T09:58:54.8535281Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.8535411Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8535532Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8535695Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8536344Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8536953Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8537565Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8538166Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8538800Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8539432Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8540050Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8540653Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8541262Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8541871Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8542002Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.8542047Z Autotune Choices Stats: 2025-12-04T09:58:54.8542809Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.8543035Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8543208Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8543503Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8544147Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8544777Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8545408Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8546070Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8546703Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8547356Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8547984Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8548652Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8549285Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8549917Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8550055Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.8550131Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8550182Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8550221Z unimplemented [] 2025-12-04T09:58:54.8550285Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8550390Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8550969Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8551010Z graph_break [] 2025-12-04T09:58:54.8551090Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8551132Z Autotune Choices Stats: 2025-12-04T09:58:54.8551880Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.8552030Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8552146Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8552323Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8552935Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8553544Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8554151Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8554765Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8555370Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8556032Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8556681Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8557286Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8557896Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8558501Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8558639Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.8558682Z Autotune Choices Stats: 2025-12-04T09:58:54.8559445Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.8559663Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8559834Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8560127Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8560759Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8561416Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8562042Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8562664Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8563294Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8563921Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8564555Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8565222Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8565847Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8566500Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8566631Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.8566705Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8566752Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8566790Z unimplemented [] 2025-12-04T09:58:54.8566856Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8566959Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8567545Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8567585Z graph_break [] 2025-12-04T09:58:54.8567663Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8567705Z Autotune Choices Stats: 2025-12-04T09:58:54.8568442Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.8568591Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8568707Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8568902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8569533Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8570139Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8570751Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8571352Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8571959Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8572565Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8573182Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8573819Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8574421Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8575029Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8575164Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.8575205Z Autotune Choices Stats: 2025-12-04T09:58:54.8576010Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.8576237Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8576405Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8576690Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8577338Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8578006Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8578635Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8579264Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8579890Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8580528Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8581157Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8581795Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8582450Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8583081Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8583215Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.8583291Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8583336Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8583375Z unimplemented [] 2025-12-04T09:58:54.8583438Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8583538Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8584116Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8584159Z graph_break [] 2025-12-04T09:58:54.8584233Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8584277Z Autotune Choices Stats: 2025-12-04T09:58:54.8585023Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.8585154Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8585267Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8585431Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8586102Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8586742Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8587349Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8587953Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8588556Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8589162Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8589772Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8590389Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8591017Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8591624Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8591757Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.8591798Z Autotune Choices Stats: 2025-12-04T09:58:54.8592562Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.8592784Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8592950Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8593231Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8593872Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8594511Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8595171Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8595795Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8596470Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8597096Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8597717Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8598351Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8598997Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8599658Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8599789Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.8599864Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8599910Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8599949Z unimplemented [] 2025-12-04T09:58:54.8600014Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8600113Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8600698Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8600737Z graph_break [] 2025-12-04T09:58:54.8600811Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8600858Z Autotune Choices Stats: 2025-12-04T09:58:54.8601605Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.8601735Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8601859Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8602021Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8602639Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8603259Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8603895Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8604501Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8605112Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8605716Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8606379Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8606984Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8607608Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8608247Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8608379Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.8608422Z Autotune Choices Stats: 2025-12-04T09:58:54.8609184Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.8609403Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8609569Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8609848Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8610483Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8611112Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8611753Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8612423Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8613049Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8613681Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8614309Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8614942Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8615567Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8616248Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8616404Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.8616481Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8616524Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8616580Z unimplemented [] 2025-12-04T09:58:54.8616642Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8616746Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8617322Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8617362Z graph_break [] 2025-12-04T09:58:54.8617435Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8617477Z Autotune Choices Stats: 2025-12-04T09:58:54.8618211Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.8618341Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8618456Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8618618Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8619227Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8619845Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8620449Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8621087Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8621689Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8622296Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8622902Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8623507Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8624111Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8624725Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8624874Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.8624919Z Autotune Choices Stats: 2025-12-04T09:58:54.8625693Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.8625914Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8626113Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8626391Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8627028Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8627669Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8628299Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8628944Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8629622Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8630267Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8630896Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8631526Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8632152Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8632795Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8632924Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.8633001Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8633062Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8633103Z unimplemented [] 2025-12-04T09:58:54.8633163Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8633265Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8633852Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8633895Z graph_break [] 2025-12-04T09:58:54.8633969Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8634011Z Autotune Choices Stats: 2025-12-04T09:58:54.8634763Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.8634889Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8635007Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8635172Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8635792Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8636428Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8637056Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8637674Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8638314Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8638920Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8639522Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8640127Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8640730Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8641343Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8641472Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.8641515Z Autotune Choices Stats: 2025-12-04T09:58:54.8642307Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.8642526Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8642694Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8642971Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8643608Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8644242Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8644867Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8645514Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8646168Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8646844Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8647468Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8648100Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8648735Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8649362Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8649490Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.8649566Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8649610Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8649650Z unimplemented [] 2025-12-04T09:58:54.8649711Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8649827Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8650398Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8650455Z graph_break [] 2025-12-04T09:58:54.8650530Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8650572Z Autotune Choices Stats: 2025-12-04T09:58:54.8651331Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:54.8651461Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8651577Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8651742Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8652352Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8652957Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8653573Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8654192Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8654800Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8655444Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8656082Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8656685Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8657292Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8657901Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8658037Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:54.8658080Z Autotune Choices Stats: 2025-12-04T09:58:54.8658864Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.8659107Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8659296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8659578Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8660215Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8660846Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8661562Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8662187Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8662838Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8663471Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8664130Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8664766Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8665398Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8666069Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8666202Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:54.8666281Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8666331Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8666371Z unimplemented [] 2025-12-04T09:58:54.8666439Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8666540Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8667148Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8667187Z graph_break [] 2025-12-04T09:58:54.8667267Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8667346Z Autotune Choices Stats: 2025-12-04T09:58:54.8668111Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.8668243Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8668358Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8668525Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8669144Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8669755Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8670369Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8670981Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8671600Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8672241Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8672851Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8673466Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8674070Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8674683Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8674816Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:54.8674859Z Autotune Choices Stats: 2025-12-04T09:58:54.8675627Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.8675995Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8676162Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8676475Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8677122Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8677754Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8678388Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8679019Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8679650Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8680295Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8680948Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8681581Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8682213Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8682841Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8682979Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:54.8683057Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8683106Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8683145Z unimplemented [] 2025-12-04T09:58:54.8683210Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8683312Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8683893Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8683933Z graph_break [] 2025-12-04T09:58:54.8684013Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8684056Z Autotune Choices Stats: 2025-12-04T09:58:54.8684815Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:54.8684968Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8685094Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8685263Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8685884Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8686524Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8687131Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8687744Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8688350Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8688984Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8689634Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8690245Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8690853Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8691458Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8691594Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:54.8691636Z Autotune Choices Stats: 2025-12-04T09:58:54.8692400Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.8692625Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8692809Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8693092Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8693768Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8694401Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8695034Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8695665Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8696334Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8696964Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8697616Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8698279Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8698915Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8699543Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8699677Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:54.8699756Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8699805Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8699845Z unimplemented [] 2025-12-04T09:58:54.8699910Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8700011Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8700591Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8700636Z graph_break [] 2025-12-04T09:58:54.8700712Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8700760Z Autotune Choices Stats: 2025-12-04T09:58:54.8701523Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:54.8701656Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8701770Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8701956Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8702585Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8703189Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8703802Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8704415Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8705031Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8705709Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8706370Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8707016Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8707624Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8708228Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8708363Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:54.8708407Z Autotune Choices Stats: 2025-12-04T09:58:54.8709173Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.8709395Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8709564Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8709846Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8710511Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8711171Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8711800Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8712435Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8713068Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8713697Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8714330Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8714975Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8715637Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8716391Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8716527Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:54.8716610Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8716656Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8716701Z unimplemented [] 2025-12-04T09:58:54.8716766Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8716872Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8717455Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8717499Z graph_break [] 2025-12-04T09:58:54.8717575Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8717622Z Autotune Choices Stats: 2025-12-04T09:58:54.8718363Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:54.8718498Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8718620Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8718804Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8719426Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8720072Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8720674Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8721287Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8721901Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8722513Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8723117Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8723738Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8724381Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8724988Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8725119Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:54.8725168Z Autotune Choices Stats: 2025-12-04T09:58:54.8725974Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.8726199Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8726372Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8726651Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8727288Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8727949Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8728616Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8729245Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8729879Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8730517Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8731144Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8731775Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8732419Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8733078Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8733208Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:54.8733288Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8733332Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8733374Z unimplemented [] 2025-12-04T09:58:54.8733437Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8733542Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8734120Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8734164Z graph_break [] 2025-12-04T09:58:54.8734241Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8734286Z Autotune Choices Stats: 2025-12-04T09:58:54.8735029Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.8735156Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8735277Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8735439Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8736110Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8736710Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8737350Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8737957Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8738565Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8739172Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8739785Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8740406Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8741012Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8741656Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8741788Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:54.8741838Z Autotune Choices Stats: 2025-12-04T09:58:54.8742609Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.8742833Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8743004Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8743278Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8743917Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8744559Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8745193Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8745851Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8746508Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8747144Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8747773Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8748412Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8749064Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8749713Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8749868Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:54.8749962Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8750007Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8750052Z unimplemented [] 2025-12-04T09:58:54.8750114Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8750219Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8750805Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8750846Z graph_break [] 2025-12-04T09:58:54.8750927Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8750970Z Autotune Choices Stats: 2025-12-04T09:58:54.8751721Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:54.8751852Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8751972Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8752137Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8752750Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8753374Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8753982Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8754617Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8755231Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8755837Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8756477Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8757075Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8757697Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8758306Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8758468Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:54.8758525Z Autotune Choices Stats: 2025-12-04T09:58:54.8759288Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.8759508Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8759682Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8759964Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8760598Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8761232Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8761877Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8762505Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8763173Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8763806Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8764442Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8765073Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8765708Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8766396Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8766531Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:54.8766639Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8766691Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8766731Z unimplemented [] 2025-12-04T09:58:54.8766798Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8766900Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8767498Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8767540Z graph_break [] 2025-12-04T09:58:54.8767624Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8767667Z Autotune Choices Stats: 2025-12-04T09:58:54.8768415Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:54.8768547Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8768667Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8768838Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8769460Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8770073Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8770697Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8771350Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8771961Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8772568Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8773180Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8773786Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8774388Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8775022Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8775159Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:54.8775224Z Autotune Choices Stats: 2025-12-04T09:58:54.8776046Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.8776264Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8776436Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8776715Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8777352Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8777989Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8778620Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8779267Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8779902Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8780563Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8781192Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8781824Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8782456Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8783090Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8783225Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:54.8783303Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8783351Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8783402Z unimplemented [] 2025-12-04T09:58:54.8786492Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8786603Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8787186Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8787274Z graph_break [] 2025-12-04T09:58:54.8787368Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8787410Z Autotune Choices Stats: 2025-12-04T09:58:54.8788154Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:54.8788285Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8788404Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8788570Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8789185Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8789803Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8790411Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8791029Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8791670Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8792269Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8792878Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8793481Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8794087Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8794687Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8794819Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:54.8794859Z Autotune Choices Stats: 2025-12-04T09:58:54.8795625Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.8795864Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8796077Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8796357Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8796992Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8797620Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8798250Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8798880Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8799522Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8800185Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8800816Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8801442Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8802067Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8802691Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8802823Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:54.8802900Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8802945Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8802983Z unimplemented [] 2025-12-04T09:58:54.8803048Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8803147Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8803734Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8803794Z graph_break [] 2025-12-04T09:58:54.8803867Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8803908Z Autotune Choices Stats: 2025-12-04T09:58:54.8804662Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:54.8804790Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8804908Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8805072Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8805685Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8806322Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8806923Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8807529Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8808156Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8808801Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8809405Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8810012Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8810617Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8811219Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8811348Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:54.8811389Z Autotune Choices Stats: 2025-12-04T09:58:54.8812166Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.8812385Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8812569Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8812856Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8813486Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8814114Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8814739Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8815365Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8816037Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8816685Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8817338Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8817965Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8818597Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8819219Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8819353Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:54.8819428Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8819472Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8819511Z unimplemented [] 2025-12-04T09:58:54.8819571Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8819670Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8820246Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8820284Z graph_break [] 2025-12-04T09:58:54.8820358Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8820400Z Autotune Choices Stats: 2025-12-04T09:58:54.8821147Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:54.8821297Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8821420Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8821581Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8822189Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8822793Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8823395Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8823993Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8824599Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8825222Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8825859Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8826496Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8827100Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8827707Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8827840Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:54.8827881Z Autotune Choices Stats: 2025-12-04T09:58:54.8828640Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.8828861Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8829049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8829323Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8829993Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8830617Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8831241Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8831871Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8832499Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8833128Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8833761Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8834422Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8835047Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8835672Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8835801Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:54.8835895Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.8835983Z Traceback (most recent call last): 2025-12-04T09:58:54.8836137Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.8836177Z self.assertTrue( 2025-12-04T09:58:54.8836286Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.8836336Z raise self.failureException(msg) 2025-12-04T09:58:54.8836463Z AssertionError: False is not true : Log file /tmp/tmpg_se1byr/flex_attention_configs.json was not created 2025-12-04T09:58:54.8836466Z 2025-12-04T09:58:54.8836542Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.8836707Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.8836711Z 2025-12-04T09:58:54.8836802Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.8836878Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8836921Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8836959Z unimplemented [] 2025-12-04T09:58:54.8837023Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8837620Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.8837745Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8837781Z graph_break [] 2025-12-04T09:58:54.8837858Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8838361Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.8838411Z current_size = base.storage().size() 2025-12-04T09:58:54.8838452Z Autotune Choices Stats: 2025-12-04T09:58:54.8839194Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.8839325Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8839440Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8839600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8840208Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8840822Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8841425Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8842034Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8842670Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8843276Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8843876Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8844472Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8845070Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8845676Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8845809Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.8845849Z Autotune Choices Stats: 2025-12-04T09:58:54.8846655Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.8846908Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8847075Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8847354Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8847988Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8848609Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8849230Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8849854Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8850489Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8851143Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8851763Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8852411Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8853031Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8853655Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8853786Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.8853860Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8853904Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8853942Z unimplemented [] 2025-12-04T09:58:54.8854005Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8854109Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8854692Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8854753Z graph_break [] 2025-12-04T09:58:54.8854828Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8854868Z Autotune Choices Stats: 2025-12-04T09:58:54.8855615Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.8855744Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8855859Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8856050Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8856663Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8857262Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8857864Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8858467Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8859082Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8859712Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8860311Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8860919Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8861518Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8862120Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8862250Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.8862290Z Autotune Choices Stats: 2025-12-04T09:58:54.8863061Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.8863278Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8863462Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8863749Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8864375Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8865003Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8865626Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8866281Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8866904Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8867556Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8868213Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8868838Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8869459Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8870085Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8870215Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.8870289Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8870333Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8870370Z unimplemented [] 2025-12-04T09:58:54.8870430Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8870537Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8871114Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8871151Z graph_break [] 2025-12-04T09:58:54.8871225Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8871265Z Autotune Choices Stats: 2025-12-04T09:58:54.8872017Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.8872170Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8872295Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8872454Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8873066Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8873665Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8874267Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8874867Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8875469Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8876123Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8876766Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8877364Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8877962Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8878564Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8878697Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.8878736Z Autotune Choices Stats: 2025-12-04T09:58:54.8879490Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.8879707Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8879881Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8880158Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8880825Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8881449Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8882071Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8882694Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8883324Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8883950Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8884586Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8885256Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8885883Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8886537Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8886666Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.8886741Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8886783Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8886820Z unimplemented [] 2025-12-04T09:58:54.8886881Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8886979Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8887551Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8887590Z graph_break [] 2025-12-04T09:58:54.8887663Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8887704Z Autotune Choices Stats: 2025-12-04T09:58:54.8888471Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.8888598Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8888739Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8888901Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8889530Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8890130Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8890733Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8891335Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8891941Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8892542Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8893150Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8893785Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8894382Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8894982Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8895112Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.8895153Z Autotune Choices Stats: 2025-12-04T09:58:54.8895915Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.8896172Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8896341Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8896616Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8897262Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8897914Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8898540Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8899161Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8899787Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8900431Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8901059Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8901695Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8902355Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8902980Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8903109Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.8903185Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8903226Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8903267Z unimplemented [] 2025-12-04T09:58:54.8903327Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8903427Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8904001Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8904038Z graph_break [] 2025-12-04T09:58:54.8904112Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8904157Z Autotune Choices Stats: 2025-12-04T09:58:54.8904899Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.8905029Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8905144Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8905312Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8905976Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8906613Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8907215Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8907818Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8908418Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8909017Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8909643Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8910244Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8910882Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8911482Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8911615Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.8911658Z Autotune Choices Stats: 2025-12-04T09:58:54.8912422Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.8912638Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8912806Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8913083Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8913718Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8914354Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8915005Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8915628Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8916316Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8916944Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8917569Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8918209Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8918836Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8919508Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8919634Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.8919712Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8919753Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8919794Z unimplemented [] 2025-12-04T09:58:54.8919854Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8919956Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8920534Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8920575Z graph_break [] 2025-12-04T09:58:54.8920649Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8920693Z Autotune Choices Stats: 2025-12-04T09:58:54.8921435Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.8921562Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8921680Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8921840Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8922467Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8923081Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8923717Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8924325Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8924933Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8925541Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8926181Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8926803Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8927407Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8928041Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8928170Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.8928215Z Autotune Choices Stats: 2025-12-04T09:58:54.8928975Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.8929192Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8929358Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8929636Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8930264Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8930900Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8931521Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8932192Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8932818Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8933450Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8934075Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8934703Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8935346Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8936003Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8936172Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.8936269Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8936311Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8936352Z unimplemented [] 2025-12-04T09:58:54.8936414Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8936515Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8937089Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.8937129Z graph_break [] 2025-12-04T09:58:54.8937205Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8937245Z Autotune Choices Stats: 2025-12-04T09:58:54.8937986Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.8938113Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8938230Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8938392Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8939014Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8939636Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8940237Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8940875Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8941480Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8942085Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8942688Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8943303Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8943922Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8944525Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8944773Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.8944816Z Autotune Choices Stats: 2025-12-04T09:58:54.8945578Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.8945795Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8946002Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8946278Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8946909Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8947538Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8948177Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8948797Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8949463Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8950088Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8950711Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8951340Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8951967Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8952609Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8952740Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.8952838Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8952882Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8952921Z unimplemented [] 2025-12-04T09:58:54.8952987Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8953086Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8953676Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8953715Z graph_break [] 2025-12-04T09:58:54.8953791Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8953831Z Autotune Choices Stats: 2025-12-04T09:58:54.8954576Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.8954705Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8954820Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8954983Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8955589Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8956228Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8956842Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8957483Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8958085Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8958692Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8959310Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8959914Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8960518Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8961145Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8961276Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.8961336Z Autotune Choices Stats: 2025-12-04T09:58:54.8962104Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.8962321Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8962495Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8962777Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8963407Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8964031Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8964657Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8965308Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8965966Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8966642Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8967272Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8967897Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8968523Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8969145Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8969274Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.8969350Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8969393Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8969442Z unimplemented [] 2025-12-04T09:58:54.8969505Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8969603Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8970177Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8970233Z graph_break [] 2025-12-04T09:58:54.8970318Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8970359Z Autotune Choices Stats: 2025-12-04T09:58:54.8971098Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.8971226Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8971340Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8971508Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8972117Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8972795Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8973397Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8974017Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8974650Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8975247Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8975853Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8976485Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8977087Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8977683Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8977814Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.8977855Z Autotune Choices Stats: 2025-12-04T09:58:54.8978641Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.8978882Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8979060Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8979338Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8979972Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8980602Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8981227Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8981848Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8982484Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8983151Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8983779Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8984403Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8985031Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8985658Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8985788Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.8985865Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.8985909Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.8985989Z unimplemented [] 2025-12-04T09:58:54.8986051Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.8986151Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.8986743Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.8986808Z graph_break [] 2025-12-04T09:58:54.8986881Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.8986924Z Autotune Choices Stats: 2025-12-04T09:58:54.8987678Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.8987806Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8987921Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8988084Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8988701Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8989308Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8989919Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8990524Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8991130Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.8991761Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8992364Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8992966Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8993566Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8994274Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8994405Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.8994447Z Autotune Choices Stats: 2025-12-04T09:58:54.8995219Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.8995435Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.8995621Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.8995919Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.8996590Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8997213Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8997829Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8998456Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8999083Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.8999723Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9000384Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9001020Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9001646Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9002267Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9002398Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.9002472Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9002517Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9002555Z unimplemented [] 2025-12-04T09:58:54.9002618Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9002716Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9003287Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9003325Z graph_break [] 2025-12-04T09:58:54.9003398Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9003438Z Autotune Choices Stats: 2025-12-04T09:58:54.9004194Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.9004340Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9004468Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9004631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9005249Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9005851Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9006486Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9007088Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9007702Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9008319Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9008964Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9009572Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9010177Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9010777Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9010906Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.9010947Z Autotune Choices Stats: 2025-12-04T09:58:54.9011705Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.9011923Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9012098Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9012378Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9013038Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9013661Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9014287Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9014911Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9015543Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9016205Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9016850Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9017519Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9018149Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9018780Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9018912Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.9018989Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9019031Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9019071Z unimplemented [] 2025-12-04T09:58:54.9019130Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9019229Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9019806Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9019846Z graph_break [] 2025-12-04T09:58:54.9019922Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9019963Z Autotune Choices Stats: 2025-12-04T09:58:54.9020715Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.9020842Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9020980Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9021140Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9021767Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9022369Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9022978Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9023580Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9024178Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9024783Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9025404Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9026086Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9026700Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9027307Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9027436Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.9027478Z Autotune Choices Stats: 2025-12-04T09:58:54.9028299Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.9028516Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9028684Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9028961Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9029626Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9030296Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9030919Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9031546Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9032177Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9032805Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9033427Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9034071Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9034727Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9035353Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9035482Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.9035558Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9035599Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9035636Z unimplemented [] 2025-12-04T09:58:54.9035699Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9035798Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9036409Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9036451Z graph_break [] 2025-12-04T09:58:54.9036525Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9036570Z Autotune Choices Stats: 2025-12-04T09:58:54.9037310Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.9037439Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9037553Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9037731Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9038344Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9038981Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9039580Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9040188Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9040792Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9041404Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9042011Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9042627Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9043269Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9043869Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9045704Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.9045750Z Autotune Choices Stats: 2025-12-04T09:58:54.9046546Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.9046767Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9046945Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9047226Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9047862Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9048515Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9049179Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9049802Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9050435Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9051141Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9051777Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9052413Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9053060Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9053707Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9053839Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.9053918Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9053960Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9053997Z unimplemented [] 2025-12-04T09:58:54.9054059Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9054159Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9054760Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9054799Z graph_break [] 2025-12-04T09:58:54.9054876Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9054918Z Autotune Choices Stats: 2025-12-04T09:58:54.9055662Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.9055792Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9055907Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9056102Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9056735Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9057349Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9057982Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9058581Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9059207Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9059817Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9060434Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9061050Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9061657Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9062288Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9062420Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.9062464Z Autotune Choices Stats: 2025-12-04T09:58:54.9063223Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.9063467Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9063638Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9063921Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9064557Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9065187Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9065834Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9066531Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9067164Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9067809Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9068437Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9069074Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9069718Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9070349Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9070491Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.9070580Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9070623Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9070662Z unimplemented [] 2025-12-04T09:58:54.9070725Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9070828Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9071409Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9071460Z graph_break [] 2025-12-04T09:58:54.9071534Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9071575Z Autotune Choices Stats: 2025-12-04T09:58:54.9072321Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.9072449Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9072567Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9072732Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9073347Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9073973Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9074580Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9075206Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9075817Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9076478Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9077084Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9077693Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9078324Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9078931Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9079082Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.9079146Z Autotune Choices Stats: 2025-12-04T09:58:54.9079909Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.9080131Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9080331Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9080611Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9081251Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9081887Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9082535Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9083158Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9083810Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9084441Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9085145Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9085770Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9086455Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9087101Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9087230Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.9087326Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9087368Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9087407Z unimplemented [] 2025-12-04T09:58:54.9087468Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9087570Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9088152Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9088191Z graph_break [] 2025-12-04T09:58:54.9088267Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9088307Z Autotune Choices Stats: 2025-12-04T09:58:54.9089056Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.9089209Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9089323Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9089485Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9090098Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9090702Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9091323Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9091929Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9092545Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9093155Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9093764Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9094365Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9094961Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9095576Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9095704Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.9095757Z Autotune Choices Stats: 2025-12-04T09:58:54.9096576Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.9096792Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9096960Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9097236Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9097886Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9098513Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9099136Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9099764Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9100394Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9101044Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9101667Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9102308Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9102953Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9103581Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9103711Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.9103786Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9103831Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9103867Z unimplemented [] 2025-12-04T09:58:54.9103938Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9104039Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9104612Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9104665Z graph_break [] 2025-12-04T09:58:54.9104748Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9104788Z Autotune Choices Stats: 2025-12-04T09:58:54.9105527Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.9105653Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9105780Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9105980Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9106588Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9107190Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9107789Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9108411Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9109048Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9109650Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9110252Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9110862Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9111534Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9112137Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9112267Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.9112309Z Autotune Choices Stats: 2025-12-04T09:58:54.9113082Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.9113316Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9113496Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9113774Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9114398Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9115035Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9115661Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9116358Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9117001Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9117629Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9118285Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9118909Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9119555Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9120187Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9120365Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.9120441Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9120485Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9120521Z unimplemented [] 2025-12-04T09:58:54.9120583Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9120683Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9121261Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9121298Z graph_break [] 2025-12-04T09:58:54.9121382Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9121423Z Autotune Choices Stats: 2025-12-04T09:58:54.9122176Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.9122302Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9122416Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9122578Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9123189Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9123805Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9124410Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9125020Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9125624Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9126288Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9126899Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9127503Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9128119Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9128722Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9128852Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.9128894Z Autotune Choices Stats: 2025-12-04T09:58:54.9129679Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.9129896Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9130074Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9130351Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9130993Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9131621Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9132255Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9132876Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9133503Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9134142Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9134782Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9135411Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9136076Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9136727Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9136854Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.9136931Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9136976Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9137012Z unimplemented [] 2025-12-04T09:58:54.9137074Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9137172Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9137748Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9137790Z graph_break [] 2025-12-04T09:58:54.9137863Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9137905Z Autotune Choices Stats: 2025-12-04T09:58:54.9138665Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.9138806Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9138933Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9139095Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9139706Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9140312Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9140926Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9141528Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9142136Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9142751Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9143372Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9143978Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9144581Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9145206Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9145335Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.9145375Z Autotune Choices Stats: 2025-12-04T09:58:54.9146170Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.9146389Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9146573Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9146850Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9147508Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9148131Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9148756Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9149393Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9150018Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9150643Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9151280Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9151931Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9152559Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9153181Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9153322Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.9153398Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9153441Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9153478Z unimplemented [] 2025-12-04T09:58:54.9153539Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9153638Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9154213Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9154254Z graph_break [] 2025-12-04T09:58:54.9154327Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9154368Z Autotune Choices Stats: 2025-12-04T09:58:54.9155114Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.9155245Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9155360Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9155529Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9156200Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9156795Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9157397Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9158018Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9158626Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9159232Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9159847Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9160487Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9161086Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9161687Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9161830Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.9161871Z Autotune Choices Stats: 2025-12-04T09:58:54.9162631Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.9162848Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9163014Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9163290Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9163934Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9164581Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9165207Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9165828Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9166522Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9167157Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9167779Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9168418Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9169068Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9169693Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9169822Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.9169912Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9169954Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9169991Z unimplemented [] 2025-12-04T09:58:54.9170052Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9170155Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9170727Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9170765Z graph_break [] 2025-12-04T09:58:54.9170840Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9170882Z Autotune Choices Stats: 2025-12-04T09:58:54.9171623Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.9171751Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9171868Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9172036Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9172644Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9173260Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9173869Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9174482Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9175084Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9175684Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9176345Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9176966Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9177598Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9178202Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9178331Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.9178392Z Autotune Choices Stats: 2025-12-04T09:58:54.9179152Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.9179369Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9179536Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9179812Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9180449Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9181083Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9181729Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9182355Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9182986Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9183620Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9184248Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9184875Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9185517Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9186211Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9186340Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.9186415Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9186457Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9186495Z unimplemented [] 2025-12-04T09:58:54.9186556Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9186655Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9187240Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9187295Z graph_break [] 2025-12-04T09:58:54.9187371Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9187412Z Autotune Choices Stats: 2025-12-04T09:58:54.9188156Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.9188283Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9188398Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9188561Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9189192Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9189797Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9190423Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9191026Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9191648Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9192247Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9192848Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9193472Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9194077Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9194703Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9194833Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.9194877Z Autotune Choices Stats: 2025-12-04T09:58:54.9195635Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.9195867Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9196079Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9196357Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9196995Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9197622Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9198271Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9198914Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9199541Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9200183Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9200810Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9201438Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9202077Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9202707Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9202846Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.9202932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9202975Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9203012Z unimplemented [] 2025-12-04T09:58:54.9203075Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9203173Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9203749Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9203787Z graph_break [] 2025-12-04T09:58:54.9203874Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9203915Z Autotune Choices Stats: 2025-12-04T09:58:54.9204662Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.9204790Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9204905Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9205067Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9205674Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9206344Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9206953Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9207581Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9208183Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9208807Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9209419Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9210027Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9210638Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9211245Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9211382Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.9211423Z Autotune Choices Stats: 2025-12-04T09:58:54.9212190Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.9212405Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9212585Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9212863Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9213494Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9214122Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9214758Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9215383Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9216079Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9216707Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9217344Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9217973Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9218600Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9219244Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9219374Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.9219462Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9219505Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9219541Z unimplemented [] 2025-12-04T09:58:54.9219602Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9219703Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9220292Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9220329Z graph_break [] 2025-12-04T09:58:54.9220406Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9220446Z Autotune Choices Stats: 2025-12-04T09:58:54.9221191Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.9221331Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9221444Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9221606Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9222215Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9222823Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9223446Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9224045Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9224678Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9225284Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9225896Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9226544Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9227147Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9227768Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9227898Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.9227953Z Autotune Choices Stats: 2025-12-04T09:58:54.9228722Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.9228940Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9229108Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9229386Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9230032Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9230657Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9231276Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9231918Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9232543Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9233192Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9233821Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9234459Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9235091Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9235715Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9235846Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.9235961Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9236005Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9236042Z unimplemented [] 2025-12-04T09:58:54.9236119Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9236219Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9236793Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9236846Z graph_break [] 2025-12-04T09:58:54.9236934Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9236976Z Autotune Choices Stats: 2025-12-04T09:58:54.9237714Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.9237841Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9237970Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9238133Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9238752Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9239359Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9239964Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9240585Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9241211Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9241812Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9242416Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9243038Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9243645Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9244248Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9244379Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.9244420Z Autotune Choices Stats: 2025-12-04T09:58:54.9245198Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.9245424Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9245599Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9245873Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9246575Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9247217Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9247844Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9248470Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9249110Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9249747Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9250377Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9251002Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9251644Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9252272Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9252402Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.9252478Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9252520Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9252559Z unimplemented [] 2025-12-04T09:58:54.9252619Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9252719Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9253304Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9253343Z graph_break [] 2025-12-04T09:58:54.9253427Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9253470Z Autotune Choices Stats: 2025-12-04T09:58:54.9254216Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:54.9254346Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9254462Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9254622Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9255233Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9255846Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9256496Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9257102Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9257729Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9258356Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9258960Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9259567Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9260176Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9260787Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9260917Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:54.9260963Z Autotune Choices Stats: 2025-12-04T09:58:54.9261735Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.9261953Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9262130Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9262415Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9263048Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9263676Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9264312Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9264938Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9265566Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9266248Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9266900Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9267532Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9268161Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9268798Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9268926Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:54.9269002Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9269043Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9269082Z unimplemented [] 2025-12-04T09:58:54.9269142Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9269242Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9269817Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9269857Z graph_break [] 2025-12-04T09:58:54.9269931Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9269973Z Autotune Choices Stats: 2025-12-04T09:58:54.9270724Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.9270859Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9270987Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9271148Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9271760Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9272361Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9272981Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9273585Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9274194Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9274808Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9275427Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9276073Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9276676Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9277288Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9277417Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:54.9277459Z Autotune Choices Stats: 2025-12-04T09:58:54.9278222Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.9278440Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9278619Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9278895Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9279550Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9280175Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9280802Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9281435Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9282065Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9282695Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9283335Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9283986Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9284614Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9285243Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9285381Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:54.9285456Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9285497Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9285537Z unimplemented [] 2025-12-04T09:58:54.9285597Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9285696Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9286311Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9286349Z graph_break [] 2025-12-04T09:58:54.9286422Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9286462Z Autotune Choices Stats: 2025-12-04T09:58:54.9287223Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:54.9287349Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9287475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9287634Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9288262Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9288866Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9289484Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9290088Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9290692Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9291298Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9291920Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9292544Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9293149Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9293755Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9296316Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:54.9296364Z Autotune Choices Stats: 2025-12-04T09:58:54.9297131Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.9297351Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9297519Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9297799Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9298472Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9299128Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9299751Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9300377Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9301021Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9301648Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9302271Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9302913Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9303562Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9304185Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9304328Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:54.9304407Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9304452Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9304490Z unimplemented [] 2025-12-04T09:58:54.9304553Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9304654Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9305231Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9305268Z graph_break [] 2025-12-04T09:58:54.9305346Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9305386Z Autotune Choices Stats: 2025-12-04T09:58:54.9306166Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:54.9306295Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9306412Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9306592Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9307199Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9307837Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9308441Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9309061Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9309667Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9310273Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9310881Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9311477Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9312101Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9312703Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9312845Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:54.9312885Z Autotune Choices Stats: 2025-12-04T09:58:54.9313638Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.9313856Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9314027Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9314303Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9314935Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9315569Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9316278Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9316903Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9317544Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9318169Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9318801Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9319447Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9320071Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9320717Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9320847Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:54.9320921Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9320965Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9321003Z unimplemented [] 2025-12-04T09:58:54.9321065Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9321165Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9321752Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9321789Z graph_break [] 2025-12-04T09:58:54.9321864Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9321904Z Autotune Choices Stats: 2025-12-04T09:58:54.9322647Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:54.9322775Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9322889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9323050Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9323679Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9324277Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9324901Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9325507Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9326184Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9326790Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9327390Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9328018Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9328621Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9329250Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9329381Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:54.9329422Z Autotune Choices Stats: 2025-12-04T09:58:54.9330179Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.9330411Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9330575Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9330851Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9331482Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9332118Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9332746Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9333393Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9334022Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9334666Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9335288Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9335910Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9336600Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9337229Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9337386Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:54.9337460Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9337505Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9337542Z unimplemented [] 2025-12-04T09:58:54.9337603Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9337701Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9338274Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9338340Z graph_break [] 2025-12-04T09:58:54.9338414Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9338456Z Autotune Choices Stats: 2025-12-04T09:58:54.9339203Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.9339331Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9339445Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9339607Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9340219Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9340832Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9341459Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9342061Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9342662Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9343272Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9343882Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9344487Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9345102Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9345701Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9345852Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:54.9345892Z Autotune Choices Stats: 2025-12-04T09:58:54.9346693Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.9346914Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9347096Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9347373Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9348012Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9348635Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9349276Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9349900Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9350565Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9351189Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9351824Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9352457Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9353084Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9353722Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9353861Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:54.9353935Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9353978Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9354015Z unimplemented [] 2025-12-04T09:58:54.9354078Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9354178Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9354783Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9354822Z graph_break [] 2025-12-04T09:58:54.9354895Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9354935Z Autotune Choices Stats: 2025-12-04T09:58:54.9355676Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:54.9355816Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9355981Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9356140Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9356762Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9357371Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9357995Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9358621Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9359226Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9359829Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9360446Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9361048Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9361648Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9362260Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9362399Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:54.9362440Z Autotune Choices Stats: 2025-12-04T09:58:54.9363207Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.9363424Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9363591Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9363870Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9364509Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9365133Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9365763Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9366456Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9367116Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9367743Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9368371Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9369020Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9369649Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9370279Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9370408Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:54.9370494Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9370537Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9370576Z unimplemented [] 2025-12-04T09:58:54.9370636Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9370736Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9371326Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9371375Z graph_break [] 2025-12-04T09:58:54.9371449Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9371491Z Autotune Choices Stats: 2025-12-04T09:58:54.9372235Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:54.9372374Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9372491Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9372652Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9373263Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9373863Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9374466Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9375083Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9375702Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9376342Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9376951Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9377568Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9378171Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9378772Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9378901Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:54.9378943Z Autotune Choices Stats: 2025-12-04T09:58:54.9379719Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.9379966Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9380132Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9380413Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9381048Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9381690Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9382316Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9382949Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9383593Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9384242Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9384866Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9385498Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9386178Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9386811Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9386939Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:54.9387014Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9387057Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9387094Z unimplemented [] 2025-12-04T09:58:54.9387156Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9387257Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9387852Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9387902Z graph_break [] 2025-12-04T09:58:54.9387980Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9388020Z Autotune Choices Stats: 2025-12-04T09:58:54.9388776Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:54.9388902Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9389017Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9389178Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9389803Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9390407Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9391006Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9391606Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9392217Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9392841Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9393447Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9394050Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9394671Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9395279Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9395407Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:54.9395448Z Autotune Choices Stats: 2025-12-04T09:58:54.9396271Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.9396486Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9396666Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9396957Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9397584Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9398208Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9398845Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9399474Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9400103Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9400744Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9401397Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9402027Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9402653Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9403292Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9403422Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:54.9403498Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9403540Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9403578Z unimplemented [] 2025-12-04T09:58:54.9403639Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9403739Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9404309Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9404346Z graph_break [] 2025-12-04T09:58:54.9404430Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9404470Z Autotune Choices Stats: 2025-12-04T09:58:54.9405203Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:54.9405351Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9405466Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9405625Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9406279Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9406895Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9407498Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9408104Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9408736Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9409337Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9409972Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9410576Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9411189Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9411788Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9411918Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:54.9411958Z Autotune Choices Stats: 2025-12-04T09:58:54.9412720Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.9412937Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9413112Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9413389Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9414042Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9414671Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9415296Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9415970Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9416601Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9417232Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9417874Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9418537Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9419162Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9419801Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9419930Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:54.9420004Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9420047Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9420084Z unimplemented [] 2025-12-04T09:58:54.9420144Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9420244Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9420812Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9420849Z graph_break [] 2025-12-04T09:58:54.9420923Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9420962Z Autotune Choices Stats: 2025-12-04T09:58:54.9421711Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:54.9421849Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9421961Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9422125Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9422743Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9423349Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9423965Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9424568Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9425168Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9425785Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9426434Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9427067Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9427671Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9428290Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9428420Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:54.9428459Z Autotune Choices Stats: 2025-12-04T09:58:54.9429213Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.9429431Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9429595Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9429877Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9430528Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9431173Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9431796Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9432434Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9433061Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9433691Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9434326Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9434959Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9435609Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9436273Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9436423Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:54.9436497Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9436540Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9436577Z unimplemented [] 2025-12-04T09:58:54.9436639Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9436739Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9437318Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9437357Z graph_break [] 2025-12-04T09:58:54.9437431Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9437472Z Autotune Choices Stats: 2025-12-04T09:58:54.9438209Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:54.9438339Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9438467Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9438630Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9439258Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9439869Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9440471Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9441080Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9441688Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9442289Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9442903Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9443509Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9444132Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9444734Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9444874Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:54.9444915Z Autotune Choices Stats: 2025-12-04T09:58:54.9445678Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.9445901Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9446105Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9446385Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9447030Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9447654Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9448309Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9448938Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9449586Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9450214Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9450837Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9451487Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9452115Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9452766Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9452894Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:54.9452987Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:54.9453036Z Traceback (most recent call last): 2025-12-04T09:58:54.9453189Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:54.9453240Z self.assertTrue( 2025-12-04T09:58:54.9453345Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:54.9453398Z raise self.failureException(msg) 2025-12-04T09:58:54.9453524Z AssertionError: False is not true : Log file /tmp/tmpit3gbah7/flex_attention_configs.json was not created 2025-12-04T09:58:54.9453529Z 2025-12-04T09:58:54.9453604Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:54.9453768Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:54.9453770Z 2025-12-04T09:58:54.9453861Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:54.9453938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9453980Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9454019Z unimplemented [] 2025-12-04T09:58:54.9454079Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9454659Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:54.9454759Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9454798Z graph_break [] 2025-12-04T09:58:54.9454872Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9455375Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:54.9455428Z current_size = base.storage().size() 2025-12-04T09:58:54.9455468Z Autotune Choices Stats: 2025-12-04T09:58:54.9456249Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:54.9456403Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9456519Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9456678Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9457296Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9457913Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9458515Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9459116Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9459737Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9460426Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9461049Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9461652Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9462267Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9462867Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9463000Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:54.9463042Z Autotune Choices Stats: 2025-12-04T09:58:54.9463801Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.9464027Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9464204Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9464490Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9465131Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9465753Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9466442Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9467064Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9467695Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9468338Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9468961Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9469610Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9470234Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9470875Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9471006Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:54.9471081Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9471124Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9471165Z unimplemented [] 2025-12-04T09:58:54.9471227Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9471327Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9471899Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9471936Z graph_break [] 2025-12-04T09:58:54.9472010Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9472049Z Autotune Choices Stats: 2025-12-04T09:58:54.9472796Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:54.9472934Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9473049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9473212Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9473833Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9474434Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9475049Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9475652Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9476289Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9476903Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9477508Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9478134Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9478733Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9479347Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9479476Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:54.9479517Z Autotune Choices Stats: 2025-12-04T09:58:54.9480276Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.9480494Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9480659Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9480945Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9481575Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9482224Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9482847Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9483482Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9484115Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9484742Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9485377Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9486047Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9486706Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9487332Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9487476Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:54.9487551Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9487595Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9487632Z unimplemented [] 2025-12-04T09:58:54.9487692Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9487793Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9488367Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9488404Z graph_break [] 2025-12-04T09:58:54.9488481Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9488523Z Autotune Choices Stats: 2025-12-04T09:58:54.9489256Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:54.9489385Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9489510Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9489672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9490290Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9490906Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9491507Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9492118Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9492726Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9493336Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9493953Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9494551Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9495177Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9495779Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9495917Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:54.9495994Z Autotune Choices Stats: 2025-12-04T09:58:54.9496752Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:54.9496970Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9497141Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9497422Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9498064Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9498689Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9499343Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9499975Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9500619Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9501245Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9501873Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9502511Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9503137Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9503778Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9503911Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:54.9503984Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9504028Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9504065Z unimplemented [] 2025-12-04T09:58:54.9504127Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9504238Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9504807Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9504843Z graph_break [] 2025-12-04T09:58:54.9504917Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9504957Z Autotune Choices Stats: 2025-12-04T09:58:54.9505698Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:54.9505826Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9505980Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9506145Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9506772Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9507401Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9508003Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9508610Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9509221Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9509825Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9510424Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9511040Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9511639Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9512266Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9512396Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:54.9512436Z Autotune Choices Stats: 2025-12-04T09:58:54.9513199Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.9513425Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9513590Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9513868Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9514496Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9515141Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9515765Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9516446Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9517073Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9517716Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9518341Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9518967Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9519608Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9520235Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9520390Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:54.9520467Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9520509Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9520546Z unimplemented [] 2025-12-04T09:58:54.9520609Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9520708Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9521285Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9521335Z graph_break [] 2025-12-04T09:58:54.9521410Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9521451Z Autotune Choices Stats: 2025-12-04T09:58:54.9522194Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.9522324Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9522439Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9522600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9523213Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9523829Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9524453Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9525056Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9525658Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9526311Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9526917Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9527523Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9528137Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9528764Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9528895Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:54.9528935Z Autotune Choices Stats: 2025-12-04T09:58:54.9529696Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.9529933Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9530100Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9530377Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9531014Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9531643Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9532278Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9532905Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9533552Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9534175Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9534817Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9535443Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9536087Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9536725Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9536867Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:54.9536943Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9536984Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9537021Z unimplemented [] 2025-12-04T09:58:54.9537082Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9537195Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9537773Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9537812Z graph_break [] 2025-12-04T09:58:54.9537885Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9537926Z Autotune Choices Stats: 2025-12-04T09:58:54.9538662Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.9538802Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9538918Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9539077Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9539689Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9540287Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9540902Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9541529Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9542135Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9542738Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9543346Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9543947Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9544549Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9545161Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9545300Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:54.9545342Z Autotune Choices Stats: 2025-12-04T09:58:54.9546151Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.9546369Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9546534Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9546826Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9547460Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9548089Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9548716Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9549358Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9550002Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9550630Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9551256Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9551893Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9552519Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9553150Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9553280Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:54.9553367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9553411Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9553450Z unimplemented [] 2025-12-04T09:58:54.9553511Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9553615Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9554210Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9554250Z graph_break [] 2025-12-04T09:58:54.9554323Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9554364Z Autotune Choices Stats: 2025-12-04T09:58:54.9555105Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:54.9555243Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9555360Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9555520Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9556162Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9556770Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9557376Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9558014Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9558647Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9559253Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9559859Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9560472Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9561074Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9561677Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9561808Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:54.9561859Z Autotune Choices Stats: 2025-12-04T09:58:54.9562611Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.9562859Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9563029Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9563307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9563949Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9564584Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9565208Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9565834Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9566507Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9567158Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9567780Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9568408Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9569046Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9569671Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9569799Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:54.9569876Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9569918Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9569957Z unimplemented [] 2025-12-04T09:58:54.9570018Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9570121Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9570708Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9570757Z graph_break [] 2025-12-04T09:58:54.9570832Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9570873Z Autotune Choices Stats: 2025-12-04T09:58:54.9571623Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:54.9571749Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9571864Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9572026Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9572650Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9573252Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9573857Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9574460Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9575075Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9575696Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9576338Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9576945Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9577580Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9578183Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9578311Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:54.9578356Z Autotune Choices Stats: 2025-12-04T09:58:54.9579131Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.9579351Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9579537Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9579825Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9580460Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9581089Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9581727Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9582349Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9582974Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9583609Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9584259Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9584882Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9585503Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9586181Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9586312Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:54.9586389Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9586432Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9586469Z unimplemented [] 2025-12-04T09:58:54.9586530Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9586630Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9587205Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9587242Z graph_break [] 2025-12-04T09:58:54.9587315Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9587377Z Autotune Choices Stats: 2025-12-04T09:58:54.9588118Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.9588273Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9588393Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9588557Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9589161Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9589782Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9590386Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9590986Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9591600Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9592202Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9592828Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9593432Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9594053Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9594652Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9594782Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:54.9594824Z Autotune Choices Stats: 2025-12-04T09:58:54.9595588Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.9595805Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9596030Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9596309Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9596965Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9597596Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9598219Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9598859Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9599484Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9600115Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9600755Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9601399Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9602029Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9602671Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9602802Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:54.9602876Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9602920Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9602957Z unimplemented [] 2025-12-04T09:58:54.9603018Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9603118Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9603694Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9603731Z graph_break [] 2025-12-04T09:58:54.9603807Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9603846Z Autotune Choices Stats: 2025-12-04T09:58:54.9604604Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:54.9604747Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9604861Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9605021Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9605637Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9606281Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9606898Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9607497Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9608094Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9608714Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9609312Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9609941Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9610539Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9611157Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9611287Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:54.9611327Z Autotune Choices Stats: 2025-12-04T09:58:54.9612082Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:54.9612297Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9612463Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9612744Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9613384Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9614026Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9614648Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9615287Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9616031Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9616680Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9617327Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9617958Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9618606Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9619233Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9619381Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:54.9619457Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9619501Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9619539Z unimplemented [] 2025-12-04T09:58:54.9619605Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9619705Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9620286Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9620327Z graph_break [] 2025-12-04T09:58:54.9620402Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9620444Z Autotune Choices Stats: 2025-12-04T09:58:54.9621185Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:54.9621316Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9621443Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9621606Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9622221Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9622846Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9623445Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9624059Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9624662Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9625260Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9625877Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9626517Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9627150Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9627748Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9627894Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:54.9627934Z Autotune Choices Stats: 2025-12-04T09:58:54.9628699Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:54.9628917Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9629083Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9629358Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9630004Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9630625Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9631269Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9631892Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9632531Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9633159Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9633782Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9634415Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9635041Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9635687Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9635816Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:54.9635892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9635968Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9636008Z unimplemented [] 2025-12-04T09:58:54.9636068Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9636184Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9636758Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9636796Z graph_break [] 2025-12-04T09:58:54.9636868Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9636910Z Autotune Choices Stats: 2025-12-04T09:58:54.9637653Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:54.9637783Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9637900Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9638060Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9638685Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9639298Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9639911Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9640513Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9641123Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9641724Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9642326Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9642945Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9643544Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9644168Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9644296Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:54.9644336Z Autotune Choices Stats: 2025-12-04T09:58:54.9645095Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:54.9645321Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9645487Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9645765Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9646434Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9647084Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9647699Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9648354Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9648985Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9649623Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9650242Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9650871Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9651508Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9652129Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9652277Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:54.9652354Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9652397Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9652438Z unimplemented [] 2025-12-04T09:58:54.9652498Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9652600Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9653175Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9653227Z graph_break [] 2025-12-04T09:58:54.9653301Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9653344Z Autotune Choices Stats: 2025-12-04T09:58:54.9654082Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:54.9654212Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9654327Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9654488Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9655094Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9655712Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9656364Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9656964Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9657565Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9658191Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9658789Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9659394Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9660006Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9660619Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9660758Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:54.9660800Z Autotune Choices Stats: 2025-12-04T09:58:54.9661558Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.9661777Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9661954Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9662233Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9662868Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9663496Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9664128Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9664755Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9665402Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9666070Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9666705Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9667335Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9667954Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9668590Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9668729Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:54.9668805Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9668846Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9668885Z unimplemented [] 2025-12-04T09:58:54.9668945Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9669046Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9669634Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9669676Z graph_break [] 2025-12-04T09:58:54.9669754Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9669797Z Autotune Choices Stats: 2025-12-04T09:58:54.9670540Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:54.9670677Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9670795Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9670957Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9671573Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9672172Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9672783Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9673406Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9674016Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9674618Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9675229Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9675835Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9676472Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9677092Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9677233Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:54.9677275Z Autotune Choices Stats: 2025-12-04T09:58:54.9678053Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:54.9678272Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9678438Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9678714Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9679359Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9679985Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9680608Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9681234Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9681882Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9682513Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9683139Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9683775Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9684400Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9685023Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9685151Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:54.9685228Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9685287Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9685326Z unimplemented [] 2025-12-04T09:58:54.9685386Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9685490Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9686112Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9686173Z graph_break [] 2025-12-04T09:58:54.9686249Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9686289Z Autotune Choices Stats: 2025-12-04T09:58:54.9687028Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:54.9687176Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9687296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9687457Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9688062Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9688667Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9689272Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9689897Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9690525Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9691126Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9691730Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9692339Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9692938Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9693544Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9693675Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:54.9693718Z Autotune Choices Stats: 2025-12-04T09:58:54.9694478Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.9694717Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9694885Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9695164Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9695797Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9696476Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9697100Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9697720Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9698374Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9699024Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9699645Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9700273Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9700911Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9701529Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9701661Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:54.9701735Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9701779Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9701818Z unimplemented [] 2025-12-04T09:58:54.9701883Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9701983Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9702580Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9702628Z graph_break [] 2025-12-04T09:58:54.9702704Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9702744Z Autotune Choices Stats: 2025-12-04T09:58:54.9703491Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.9703621Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9703737Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9703899Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9704511Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9705126Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9705728Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9706363Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9706996Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9707618Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9708220Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9708824Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9709440Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9710043Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9710174Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:54.9710215Z Autotune Choices Stats: 2025-12-04T09:58:54.9710978Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:54.9711196Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9711383Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9711672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9712302Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9712930Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9713563Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9714191Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9714817Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9715454Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9716217Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9716849Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9717474Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9718113Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9718245Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:54.9718319Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9718365Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9718402Z unimplemented [] 2025-12-04T09:58:54.9718464Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9718564Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9719139Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9719178Z graph_break [] 2025-12-04T09:58:54.9719253Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9719307Z Autotune Choices Stats: 2025-12-04T09:58:54.9720042Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:54.9720185Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9720312Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9720474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9721088Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9721711Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9722314Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9722915Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9723520Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9724136Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9724758Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9725363Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9725998Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9726609Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9726739Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:54.9726779Z Autotune Choices Stats: 2025-12-04T09:58:54.9727534Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:54.9727756Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9727932Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9728208Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9728866Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9729489Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9730118Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9730751Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9731376Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9732004Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9732638Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9733281Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9733908Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9734533Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9734672Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:54.9734745Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9734787Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9734971Z unimplemented [] 2025-12-04T09:58:54.9735036Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9735137Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9735712Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9735752Z graph_break [] 2025-12-04T09:58:54.9735825Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9735867Z Autotune Choices Stats: 2025-12-04T09:58:54.9736653Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:54.9736781Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9736909Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9737071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9737703Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9738300Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9738918Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9739519Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9740120Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9740730Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9741339Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9741965Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9742566Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9743178Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9743312Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:54.9743352Z Autotune Choices Stats: 2025-12-04T09:58:54.9744118Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.9744339Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9744505Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9744782Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9745427Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9746117Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9746740Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9747361Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9748007Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9748635Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9749257Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9749901Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9750544Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9751169Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9751310Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:54.9751384Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9751427Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9751467Z unimplemented [] 2025-12-04T09:58:54.9751530Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9751632Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9752214Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9752255Z graph_break [] 2025-12-04T09:58:54.9752329Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9752371Z Autotune Choices Stats: 2025-12-04T09:58:54.9753109Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.9753238Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9753354Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9753525Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9754135Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9754766Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9755366Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9756023Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9756631Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9757229Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9757852Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9758453Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9759087Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9759686Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9759828Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:54.9759868Z Autotune Choices Stats: 2025-12-04T09:58:54.9760621Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.9760839Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9761010Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9761288Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9761925Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9762558Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9763203Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9763826Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9764464Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9765091Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9765715Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9766399Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9767024Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9767673Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9767802Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:54.9767877Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9767920Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9767962Z unimplemented [] 2025-12-04T09:58:54.9768023Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9768125Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9768710Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9768748Z graph_break [] 2025-12-04T09:58:54.9768822Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9768866Z Autotune Choices Stats: 2025-12-04T09:58:54.9769606Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:54.9769734Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9769852Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9770013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9770639Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9771238Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9771857Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9772459Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9773071Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9773673Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9774275Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9774889Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9775488Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9776140Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9776269Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:54.9776311Z Autotune Choices Stats: 2025-12-04T09:58:54.9777076Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:54.9777310Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9777479Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9777757Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9778386Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9779032Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9779655Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9780299Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9780927Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9781565Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9782184Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9782809Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9783455Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9784080Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9784229Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:54.9784306Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9784349Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9784388Z unimplemented [] 2025-12-04T09:58:54.9784450Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9784550Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9785122Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9785178Z graph_break [] 2025-12-04T09:58:54.9785258Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9785298Z Autotune Choices Stats: 2025-12-04T09:58:54.9786087Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:54.9786214Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9786330Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9786491Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9787098Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9787725Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9788340Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9788951Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9789554Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9790173Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9790775Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9791375Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9791992Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9792591Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9792741Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:54.9792786Z Autotune Choices Stats: 2025-12-04T09:58:54.9793544Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:54.9793761Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9793937Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9794218Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9794850Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9795476Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9796160Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9796785Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9797435Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9798060Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9798699Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9801885Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9802507Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9803161Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9803293Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:54.9803386Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9803430Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9803470Z unimplemented [] 2025-12-04T09:58:54.9803532Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9803637Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9804233Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9804273Z graph_break [] 2025-12-04T09:58:54.9804351Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9804391Z Autotune Choices Stats: 2025-12-04T09:58:54.9805145Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:54.9805286Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9805405Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9805568Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9806218Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9806816Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9807432Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9808056Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9808659Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9809260Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9809879Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9810482Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9811083Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9811697Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9811836Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:54.9811879Z Autotune Choices Stats: 2025-12-04T09:58:54.9812658Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:54.9812876Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9813043Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9813321Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9813961Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9814587Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9815212Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9815845Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9816510Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9817164Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9817789Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9818427Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9819055Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9819680Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9819811Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:54.9819888Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9819943Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9819982Z unimplemented [] 2025-12-04T09:58:54.9820044Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9820145Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9820736Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9820774Z graph_break [] 2025-12-04T09:58:54.9820861Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9820901Z Autotune Choices Stats: 2025-12-04T09:58:54.9821653Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:54.9821783Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9821908Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9822069Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9822678Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9823284Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9823890Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9824507Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9825123Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9825728Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9826358Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9826968Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9827570Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9828177Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9828308Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:54.9828348Z Autotune Choices Stats: 2025-12-04T09:58:54.9829122Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:54.9829362Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9829530Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9829809Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9830444Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9831079Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9831706Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9832334Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9832975Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9833624Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9834248Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9834878Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9835509Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9836175Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9836308Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:54.9836382Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9836425Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9836461Z unimplemented [] 2025-12-04T09:58:54.9836523Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9836622Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9837216Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9837263Z graph_break [] 2025-12-04T09:58:54.9837339Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9837379Z Autotune Choices Stats: 2025-12-04T09:58:54.9838128Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:54.9838256Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9838372Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9838533Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9839140Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9839751Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9840353Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9840957Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9841576Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9842196Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9842792Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9843394Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9844003Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9844610Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9844740Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:54.9844780Z Autotune Choices Stats: 2025-12-04T09:58:54.9845545Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.9845764Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9845988Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9846281Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9846914Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9847542Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9848180Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9848804Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9849429Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9850065Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9850710Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9851340Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9851963Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9852609Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9852738Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:54.9852812Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9852857Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9852894Z unimplemented [] 2025-12-04T09:58:54.9852954Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9853052Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9853627Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9853666Z graph_break [] 2025-12-04T09:58:54.9853739Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9853791Z Autotune Choices Stats: 2025-12-04T09:58:54.9854527Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:54.9854692Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9854807Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9854968Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9855580Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9856233Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9856839Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9857441Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9858050Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9858668Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9859298Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9859900Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9860502Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9861116Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9861247Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:54.9861287Z Autotune Choices Stats: 2025-12-04T09:58:54.9862052Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:54.9862270Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9862447Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9862723Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9863371Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9863998Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9864622Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9865256Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9865888Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9866554Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9867194Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9867857Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9868486Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9869125Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9869255Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:54.9869329Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9869371Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9869409Z unimplemented [] 2025-12-04T09:58:54.9869468Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9869570Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9870147Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9870186Z graph_break [] 2025-12-04T09:58:54.9870260Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9870304Z Autotune Choices Stats: 2025-12-04T09:58:54.9871050Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:54.9871180Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9871305Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9871465Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9872088Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9872698Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9873310Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9873916Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9874518Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9875129Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9875730Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9876394Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9876997Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9877612Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9877742Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:54.9877784Z Autotune Choices Stats: 2025-12-04T09:58:54.9878545Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.9878762Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9878929Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9879207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9879851Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9880500Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9881122Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9881760Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9882389Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9883018Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9883653Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9884284Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9884935Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9885557Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9885696Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:54.9885771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9885813Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9885850Z unimplemented [] 2025-12-04T09:58:54.9885910Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9886046Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9886627Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9886665Z graph_break [] 2025-12-04T09:58:54.9886738Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9886781Z Autotune Choices Stats: 2025-12-04T09:58:54.9887517Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:54.9887645Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9887776Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9887938Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9888543Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9889170Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9889774Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9890394Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9890996Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9891598Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9892212Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9892816Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9893439Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9894041Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9894181Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:54.9894221Z Autotune Choices Stats: 2025-12-04T09:58:54.9894979Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:54.9895197Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9895362Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9895638Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9896334Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9896963Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9897613Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9898237Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9898876Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9899507Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9900132Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9900775Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9901409Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9902056Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9902184Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:54.9902259Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9902302Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9902340Z unimplemented [] 2025-12-04T09:58:54.9902400Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9902511Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9903078Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9903116Z graph_break [] 2025-12-04T09:58:54.9903190Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9903230Z Autotune Choices Stats: 2025-12-04T09:58:54.9903978Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:54.9904105Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9904218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9904377Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9905001Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9905621Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9906257Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9906859Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9910460Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9911063Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9911672Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9912294Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9912899Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9913528Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9913656Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:54.9913696Z Autotune Choices Stats: 2025-12-04T09:58:54.9914459Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.9914684Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9914849Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9915126Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9915763Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9916440Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9917066Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9917712Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9918339Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9918983Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9919606Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9920234Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9920871Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9921495Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9921644Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:54.9921721Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9921766Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9921803Z unimplemented [] 2025-12-04T09:58:54.9921863Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9921964Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9922535Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9922585Z graph_break [] 2025-12-04T09:58:54.9922660Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9922699Z Autotune Choices Stats: 2025-12-04T09:58:54.9923523Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:54.9923652Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9923765Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9923925Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9924538Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9925152Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9925769Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9926410Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9927015Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9927631Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9928232Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9928834Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9929450Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9930073Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9930261Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:54.9930306Z Autotune Choices Stats: 2025-12-04T09:58:54.9931062Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.9931292Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9931458Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9931737Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9932372Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9933001Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9933638Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9934271Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9934915Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9935542Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9936216Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9936838Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9937461Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9938102Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9938243Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:54.9938318Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9938362Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9938398Z unimplemented [] 2025-12-04T09:58:54.9938458Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9938570Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9939143Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9939180Z graph_break [] 2025-12-04T09:58:54.9939256Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9939297Z Autotune Choices Stats: 2025-12-04T09:58:54.9940043Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:54.9940184Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9940298Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9940463Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9941076Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9941682Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9942295Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9942915Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9943518Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9944123Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9944740Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9945344Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9945968Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9946586Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9946733Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:54.9946774Z Autotune Choices Stats: 2025-12-04T09:58:54.9947544Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:54.9947764Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9947928Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9948223Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9948857Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9949485Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9950106Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9950742Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9951394Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9952020Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9952638Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9953278Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9953903Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9954529Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9954661Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:54.9954745Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9954790Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9954828Z unimplemented [] 2025-12-04T09:58:54.9954890Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9955002Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9955592Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:54.9955631Z graph_break [] 2025-12-04T09:58:54.9955707Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9955750Z Autotune Choices Stats: 2025-12-04T09:58:54.9956522Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:54.9956669Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9956785Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9956946Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9957553Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9958156Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9958763Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9959382Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9960011Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9960608Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9961216Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9961830Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9962440Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9963037Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9963175Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:54.9963216Z Autotune Choices Stats: 2025-12-04T09:58:54.9963976Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:54.9964217Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9964384Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9964665Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9965300Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9965968Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9966593Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9967216Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9967861Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9968506Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9969136Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9969767Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9970403Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9971033Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9971164Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:54.9971238Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9971281Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9971317Z unimplemented [] 2025-12-04T09:58:54.9971380Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9971481Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9972065Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9972113Z graph_break [] 2025-12-04T09:58:54.9972188Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9972239Z Autotune Choices Stats: 2025-12-04T09:58:54.9972986Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:54.9973118Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9973233Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9973394Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9974015Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9974613Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9975218Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9975832Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9976470Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9977099Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9977706Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9978319Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9978919Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9979518Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9979646Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:54.9979690Z Autotune Choices Stats: 2025-12-04T09:58:54.9980469Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:54.9980698Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9980863Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9981149Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9981779Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9982408Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9983052Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9983679Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9984318Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9984958Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9985597Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9986262Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9986906Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9987529Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9987659Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:54.9987735Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:54.9987776Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:54.9987817Z unimplemented [] 2025-12-04T09:58:54.9987878Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:54.9987978Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:54.9988550Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:54.9988589Z graph_break [] 2025-12-04T09:58:54.9988674Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:54.9988716Z Autotune Choices Stats: 2025-12-04T09:58:54.9989461Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:54.9989610Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9989725Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9989885Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9990494Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9991119Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9991722Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9992330Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9992946Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9993551Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:54.9994173Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9994778Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9995393Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9996040Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9996169Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:54.9996211Z Autotune Choices Stats: 2025-12-04T09:58:54.9996968Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:54.9997202Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:54.9997367Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:54.9997665Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:54.9998322Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9998949Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:54.9999589Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0000218Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0000846Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0001487Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0002114Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0002763Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0003389Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0004026Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0004157Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:55.0004233Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0004276Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0004317Z unimplemented [] 2025-12-04T09:58:55.0004378Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0004481Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0005058Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0005099Z graph_break [] 2025-12-04T09:58:55.0005175Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0005217Z Autotune Choices Stats: 2025-12-04T09:58:55.0006016Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:55.0006154Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0006270Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0006430Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0007056Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0007662Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0008277Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0008878Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0009485Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0010098Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0010699Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0011325Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0011926Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0012536Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0012664Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:55.0012705Z Autotune Choices Stats: 2025-12-04T09:58:55.0013468Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.0013684Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0013849Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0014137Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0014769Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0015424Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0016088Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0016728Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0017354Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0017986Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0018628Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0019250Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0019908Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0020533Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0020672Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:55.0020749Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0020792Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0020832Z unimplemented [] 2025-12-04T09:58:55.0020893Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0020994Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0021570Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0021609Z graph_break [] 2025-12-04T09:58:55.0021686Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0021726Z Autotune Choices Stats: 2025-12-04T09:58:55.0022466Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.0022596Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0022719Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0022880Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0023508Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0024110Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0024717Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0025326Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0025970Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0026575Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0027201Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0027812Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0028423Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0029027Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0029171Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:55.0029212Z Autotune Choices Stats: 2025-12-04T09:58:55.0029976Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.0030193Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0030359Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0030635Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0031288Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0031911Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0032555Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0033179Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0033819Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0034445Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0035067Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0035709Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0036397Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0037021Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0037150Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:55.0037224Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0037268Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0037318Z unimplemented [] 2025-12-04T09:58:55.0037382Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0037481Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0038064Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0038101Z graph_break [] 2025-12-04T09:58:55.0038179Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0038219Z Autotune Choices Stats: 2025-12-04T09:58:55.0038959Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:55.0039089Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0039203Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0039366Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0039995Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0040614Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0041217Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0041820Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0042436Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0043037Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0043644Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0044260Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0044881Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0045483Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0045614Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:55.0045653Z Autotune Choices Stats: 2025-12-04T09:58:55.0046446Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.0046688Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0046853Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0047134Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0047760Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0048413Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0049064Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0049689Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0050318Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0050958Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0051588Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0052220Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0052858Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0053504Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0053634Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:55.0053707Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0053752Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0053791Z unimplemented [] 2025-12-04T09:58:55.0053852Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0053951Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0054523Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0054571Z graph_break [] 2025-12-04T09:58:55.0054645Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0054685Z Autotune Choices Stats: 2025-12-04T09:58:55.0055427Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:55.0055560Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0055675Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0055836Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0056490Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0057108Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0057736Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0058342Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0058944Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0059558Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0060170Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0060779Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0061391Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0062021Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0062154Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:55.0062197Z Autotune Choices Stats: 2025-12-04T09:58:55.0062956Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.0063185Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0063349Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0063626Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0064259Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0064894Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0065532Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0066219Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0066853Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0067476Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0068123Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0068749Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0069381Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0070022Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0070164Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:55.0070239Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0070284Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0070321Z unimplemented [] 2025-12-04T09:58:55.0070395Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0070493Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0071067Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0071107Z graph_break [] 2025-12-04T09:58:55.0071181Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0071222Z Autotune Choices Stats: 2025-12-04T09:58:55.0071977Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1708", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.0072109Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0072224Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0072385Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0073001Z triton_flex_attention_1708 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0073608Z triton_flex_attention_1709 0.0109 ms 96.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0074223Z triton_flex_attention_1707 0.0117 ms 89.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0074840Z triton_flex_attention_1705 0.0130 ms 80.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0075446Z triton_flex_attention_1724 0.0135 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0076089Z triton_flex_attention_1706 0.0136 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0076711Z triton_flex_attention_1716 0.0142 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0077317Z triton_flex_attention_1722 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0077921Z triton_flex_attention_1714 0.0149 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0078544Z triton_flex_attention_1720 0.0162 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0078691Z SingleProcess AUTOTUNE benchmarking takes 0.2434 seconds and 0.4106 seconds precompiling for 24 choices 2025-12-04T09:58:55.0078733Z Autotune Choices Stats: 2025-12-04T09:58:55.0079510Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015359999611973763, "best_triton_pos": 0} 2025-12-04T09:58:55.0079732Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0079897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0080188Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0080820Z triton_flex_attention_backward_1743 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0081451Z triton_flex_attention_backward_1737 0.0181 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0082076Z triton_flex_attention_backward_1734 0.0187 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0082717Z triton_flex_attention_backward_1735 0.0188 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0083366Z triton_flex_attention_backward_1745 0.0203 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0083995Z triton_flex_attention_backward_1744 0.0203 ms 75.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0084627Z triton_flex_attention_backward_1742 0.0218 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0085263Z triton_flex_attention_backward_1747 0.0220 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0085892Z triton_flex_attention_backward_1738 0.0228 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0086542Z triton_flex_attention_backward_1729 0.0230 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0086697Z SingleProcess AUTOTUNE benchmarking takes 0.2527 seconds and 0.7882 seconds precompiling for 22 choices 2025-12-04T09:58:55.0086792Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:55.0086841Z Traceback (most recent call last): 2025-12-04T09:58:55.0087008Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:55.0087048Z self.assertTrue( 2025-12-04T09:58:55.0087152Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:55.0087201Z raise self.failureException(msg) 2025-12-04T09:58:55.0087332Z AssertionError: False is not true : Log file /tmp/tmpuv32uu08/flex_attention_configs.json was not created 2025-12-04T09:58:55.0087336Z 2025-12-04T09:58:55.0087426Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.0087593Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.0087596Z 2025-12-04T09:58:55.0087686Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.0087764Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0087807Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0087848Z unimplemented [] 2025-12-04T09:58:55.0087909Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0088483Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:55.0088597Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0088635Z graph_break [] 2025-12-04T09:58:55.0088712Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0089205Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:55.0089256Z current_size = base.storage().size() 2025-12-04T09:58:55.0089298Z Autotune Choices Stats: 2025-12-04T09:58:55.0090051Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:55.0090180Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0090295Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0090457Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0091076Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0091705Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0092307Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0092909Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0093523Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0094126Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0094798Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0095417Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0096073Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0096673Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0096804Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:55.0096848Z Autotune Choices Stats: 2025-12-04T09:58:55.0097616Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.0097833Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0098001Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0098280Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0098908Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0099545Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0100192Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0100812Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0101439Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0102071Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0102699Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0103321Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0103954Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0104597Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0104734Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:55.0104810Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0104857Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0104896Z unimplemented [] 2025-12-04T09:58:55.0104961Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0105061Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0105639Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0105685Z graph_break [] 2025-12-04T09:58:55.0105759Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0105799Z Autotune Choices Stats: 2025-12-04T09:58:55.0106561Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:55.0106693Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0106807Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0106972Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0107579Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0108205Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0108831Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0110519Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0111131Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0111752Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0112357Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0112961Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0113563Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0114188Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0114317Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:55.0114361Z Autotune Choices Stats: 2025-12-04T09:58:55.0115166Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.0115395Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0115559Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0115834Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0116534Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0117169Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0117793Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0118445Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0119068Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0119716Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0120350Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0120977Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0121605Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0122230Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0122373Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:55.0122449Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0122495Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0122532Z unimplemented [] 2025-12-04T09:58:55.0122604Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0122705Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0123278Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0123317Z graph_break [] 2025-12-04T09:58:55.0123408Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0123450Z Autotune Choices Stats: 2025-12-04T09:58:55.0124200Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:55.0124329Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0124444Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0124608Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0125217Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0125819Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0126453Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0127101Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0127721Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0128329Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0128947Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0129545Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0130144Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0130752Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0130898Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:55.0130940Z Autotune Choices Stats: 2025-12-04T09:58:55.0131713Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.0131930Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0132110Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0132395Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0133027Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0133652Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0134276Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0134905Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0135550Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0136212Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0136857Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0137500Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0138123Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0138745Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0138875Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:55.0138950Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0138998Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0139050Z unimplemented [] 2025-12-04T09:58:55.0139112Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0139212Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0139796Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0139834Z graph_break [] 2025-12-04T09:58:55.0139909Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0139948Z Autotune Choices Stats: 2025-12-04T09:58:55.0140696Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:55.0140840Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0140953Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0141115Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0141731Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0142334Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0142937Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0143537Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0144167Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0144783Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0145393Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0146039Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0146636Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0147242Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0147374Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:55.0147414Z Autotune Choices Stats: 2025-12-04T09:58:55.0148196Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.0148427Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0148590Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0148865Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0149504Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0150155Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0150780Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0151403Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0152028Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0152679Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0153310Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0153944Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0154567Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0155197Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0155330Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:55.0155405Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0155449Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0155488Z unimplemented [] 2025-12-04T09:58:55.0155553Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0155653Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0156274Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0156328Z graph_break [] 2025-12-04T09:58:55.0156403Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0156446Z Autotune Choices Stats: 2025-12-04T09:58:55.0157196Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.0157327Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0157456Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0157637Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0158247Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0158851Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0159460Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0160065Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0160666Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0161285Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0161905Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0162519Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0163121Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0163726Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0163857Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:55.0163899Z Autotune Choices Stats: 2025-12-04T09:58:55.0164654Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.0164881Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0165055Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0165332Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0166003Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0166643Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0167268Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0167891Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0168522Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0169146Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0169803Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0170446Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0171085Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0171713Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0171845Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:55.0171918Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0171962Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0172000Z unimplemented [] 2025-12-04T09:58:55.0172066Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0172164Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0172742Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0172783Z graph_break [] 2025-12-04T09:58:55.0172857Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0172898Z Autotune Choices Stats: 2025-12-04T09:58:55.0173649Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.0173788Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0173905Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0174066Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0174697Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0175311Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0175915Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0176545Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0177152Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0177755Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0178382Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0178996Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0179615Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0180215Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0180345Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:55.0180387Z Autotune Choices Stats: 2025-12-04T09:58:55.0181142Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.0181360Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0181526Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0181818Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0182461Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0183103Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0183733Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0184368Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0185000Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0185628Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0186291Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0186956Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0187598Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0188228Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0188358Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:55.0188433Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0188475Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0188517Z unimplemented [] 2025-12-04T09:58:55.0188578Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0188683Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0189258Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0189297Z graph_break [] 2025-12-04T09:58:55.0189371Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0189413Z Autotune Choices Stats: 2025-12-04T09:58:55.0190160Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:55.0190298Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0190416Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0190594Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0191205Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0191823Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0192429Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0193035Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0193648Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0194257Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0194857Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0195483Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0196134Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0196748Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0196876Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:55.0196918Z Autotune Choices Stats: 2025-12-04T09:58:55.0197679Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.0197900Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0198068Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0198345Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0198976Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0199626Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0200267Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0200907Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0201536Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0202167Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0202788Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0203416Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0204062Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0204707Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0204847Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:55.0204925Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0204968Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0205008Z unimplemented [] 2025-12-04T09:58:55.0205069Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0205170Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0205744Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0205785Z graph_break [] 2025-12-04T09:58:55.0205858Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0205901Z Autotune Choices Stats: 2025-12-04T09:58:55.0206684Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:55.0206812Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0206928Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0207089Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0207727Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0208330Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0208946Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0209563Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0210167Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0210768Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0211371Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0211981Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0212589Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0213206Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0213345Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:55.0213389Z Autotune Choices Stats: 2025-12-04T09:58:55.0214157Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.0214375Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0214540Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0214817Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0215452Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0216107Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0216760Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0217400Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0218040Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0218669Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0219292Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0219927Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0220581Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0221206Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0221334Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:55.0221422Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0221465Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0221527Z unimplemented [] 2025-12-04T09:58:55.0221590Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0221692Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0222269Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0222308Z graph_break [] 2025-12-04T09:58:55.0222384Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0222426Z Autotune Choices Stats: 2025-12-04T09:58:55.0223169Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.0223298Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0223416Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0223582Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0224192Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0224817Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0225422Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0226187Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0226806Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0227415Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0228024Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0228629Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0229268Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0229872Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0230002Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:55.0230045Z Autotune Choices Stats: 2025-12-04T09:58:55.0230823Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.0231051Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0231223Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0231503Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0232132Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0232761Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0233397Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0234035Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0234677Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0235319Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0236005Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0236628Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0237262Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0237925Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0238056Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:55.0238130Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0238174Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0238213Z unimplemented [] 2025-12-04T09:58:55.0238280Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0238380Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0238980Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0239031Z graph_break [] 2025-12-04T09:58:55.0239110Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0239151Z Autotune Choices Stats: 2025-12-04T09:58:55.0239893Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:55.0240023Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0240136Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0240299Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0240911Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0241514Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0242136Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0242744Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0243358Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0243975Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0244581Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0245184Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0245788Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0246456Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0246590Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:55.0246630Z Autotune Choices Stats: 2025-12-04T09:58:55.0247408Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:55.0247641Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0247808Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0248088Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0248719Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0249351Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0249976Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0250621Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0251252Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0251897Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0252532Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0253159Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0253791Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0254422Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0254559Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:55.0254633Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0254681Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0254718Z unimplemented [] 2025-12-04T09:58:55.0254779Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0254895Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0255472Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0255510Z graph_break [] 2025-12-04T09:58:55.0255588Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0255640Z Autotune Choices Stats: 2025-12-04T09:58:55.0256428Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:55.0256570Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0256687Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0256847Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0257463Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0258074Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0258676Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0259306Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0259906Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0260526Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0261144Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0261746Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0262349Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0262959Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0263100Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:55.0263139Z Autotune Choices Stats: 2025-12-04T09:58:55.0263912Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:55.0264133Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0264308Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0264598Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0265235Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0265862Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0266525Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0267155Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0267806Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0268433Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0269068Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0269711Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0270338Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0270965Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0271097Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:55.0271171Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0271217Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0271254Z unimplemented [] 2025-12-04T09:58:55.0271334Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0271435Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0272022Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0272062Z graph_break [] 2025-12-04T09:58:55.0272137Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0272180Z Autotune Choices Stats: 2025-12-04T09:58:55.0272932Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:55.0273077Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0273191Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0273354Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0273973Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0274575Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0275179Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0275783Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0276447Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0277060Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0277676Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0278287Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0278891Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0279492Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0279623Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:55.0279664Z Autotune Choices Stats: 2025-12-04T09:58:55.0280426Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:55.0280668Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0280834Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0281113Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0281759Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0282394Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0283023Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0283651Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0284278Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0284924Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0285551Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0286209Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0286849Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0287474Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0287605Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:55.0287680Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0287725Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0287763Z unimplemented [] 2025-12-04T09:58:55.0287824Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0287923Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0288495Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0288545Z graph_break [] 2025-12-04T09:58:55.0288619Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0288662Z Autotune Choices Stats: 2025-12-04T09:58:55.0289414Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:55.0289544Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0289679Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0289837Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0290458Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0291064Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0291672Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0292275Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0292876Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0293501Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0294115Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0294727Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0295329Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0295964Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0296095Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:55.0296136Z Autotune Choices Stats: 2025-12-04T09:58:55.0296895Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.0297135Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0297299Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0297600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0298259Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0298876Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0299511Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0300135Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0300767Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0301391Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0302039Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0302691Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0303333Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0303958Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0304087Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:55.0304162Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0304204Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0304243Z unimplemented [] 2025-12-04T09:58:55.0304304Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0304403Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0304981Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0305019Z graph_break [] 2025-12-04T09:58:55.0305092Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0305134Z Autotune Choices Stats: 2025-12-04T09:58:55.0305893Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:55.0306068Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0306183Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0306343Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0306979Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0307594Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0308199Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0308801Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0311625Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0312234Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0312877Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0313490Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0314105Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0314707Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0314838Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:55.0314879Z Autotune Choices Stats: 2025-12-04T09:58:55.0315639Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:55.0315861Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0316083Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0316373Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0317025Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0317666Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0318305Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0318924Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0319551Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0320177Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0320800Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0321449Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0322090Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0322722Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0322852Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:55.0322930Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0322974Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0323013Z unimplemented [] 2025-12-04T09:58:55.0323074Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0323177Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0323753Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0323793Z graph_break [] 2025-12-04T09:58:55.0323869Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0323910Z Autotune Choices Stats: 2025-12-04T09:58:55.0324651Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:55.0324789Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0324909Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0325070Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0325686Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0326340Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0326955Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0327561Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0328171Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0328773Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0329376Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0330007Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0330611Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0331219Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0331348Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:55.0331389Z Autotune Choices Stats: 2025-12-04T09:58:55.0332152Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.0332371Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0332539Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0332816Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0333442Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0334089Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0334715Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0335348Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0336017Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0336639Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0337259Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0337886Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0338543Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0339177Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0339319Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:55.0339395Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0339438Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0339476Z unimplemented [] 2025-12-04T09:58:55.0339538Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0339640Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0340216Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0340260Z graph_break [] 2025-12-04T09:58:55.0340337Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0340376Z Autotune Choices Stats: 2025-12-04T09:58:55.0341119Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.0341248Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0341363Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0341523Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0342154Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0342757Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0343370Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0343976Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0344578Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0345184Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0345780Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0346403Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0347040Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0347660Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0347801Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:55.0347841Z Autotune Choices Stats: 2025-12-04T09:58:55.0348591Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.0348809Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0348976Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0349254Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0349886Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0350505Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0351151Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0351787Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0352432Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0353059Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0353684Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0354311Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0354937Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0355584Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0355718Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:55.0355793Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0355847Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0355885Z unimplemented [] 2025-12-04T09:58:55.0355994Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0356095Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0356679Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0356717Z graph_break [] 2025-12-04T09:58:55.0356793Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0356832Z Autotune Choices Stats: 2025-12-04T09:58:55.0357572Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.0357701Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0357814Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0357975Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0358591Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0359224Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0359825Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0360442Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0361061Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0361665Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0362264Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0362868Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0363491Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0364091Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0364223Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:55.0364262Z Autotune Choices Stats: 2025-12-04T09:58:55.0365038Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.0365264Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0365430Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0365709Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0366358Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0366980Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0367604Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0368255Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0368906Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0369548Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0370177Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0370794Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0371419Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0372069Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0372197Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:55.0372270Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0372313Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0372349Z unimplemented [] 2025-12-04T09:58:55.0372410Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0372511Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0373096Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0373143Z graph_break [] 2025-12-04T09:58:55.0373218Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0373257Z Autotune Choices Stats: 2025-12-04T09:58:55.0374002Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:55.0374130Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0374244Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0374404Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0375016Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0375620Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0376275Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0376876Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0377487Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0378101Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0378708Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0379309Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0379908Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0380536Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0380664Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:55.0380705Z Autotune Choices Stats: 2025-12-04T09:58:55.0381475Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.0381703Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0381867Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0382140Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0382776Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0383401Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0384026Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0384671Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0385301Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0385977Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0386614Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0387246Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0387867Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0388494Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0388634Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:55.0388708Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0388753Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0388790Z unimplemented [] 2025-12-04T09:58:55.0388852Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0388965Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0389542Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0389581Z graph_break [] 2025-12-04T09:58:55.0389657Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0389698Z Autotune Choices Stats: 2025-12-04T09:58:55.0390453Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.0390595Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0390708Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0390869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0391482Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0392085Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0392689Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0393309Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0393917Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0394531Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0395143Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0395745Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0396398Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0396997Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0397140Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:55.0397180Z Autotune Choices Stats: 2025-12-04T09:58:55.0397951Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.0398169Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0398349Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0398635Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0399271Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0399893Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0400516Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0401142Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0401795Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0402417Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0403046Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0403685Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0404309Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0404930Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0405061Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:55.0405135Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0405178Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0405215Z unimplemented [] 2025-12-04T09:58:55.0405276Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0405391Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0406020Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0406059Z graph_break [] 2025-12-04T09:58:55.0406132Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0406173Z Autotune Choices Stats: 2025-12-04T09:58:55.0406916Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.0407056Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0407169Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0407328Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0407943Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0408544Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0409148Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0409753Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0410382Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0410995Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0411598Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0412212Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0412818Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0413419Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0413548Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:55.0413588Z Autotune Choices Stats: 2025-12-04T09:58:55.0414347Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:55.0414586Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0414751Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0415028Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0415667Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0416341Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0416970Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0417592Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0418219Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0418864Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0419487Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0420126Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0420762Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0421393Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0421521Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:55.0421596Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0421638Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0421675Z unimplemented [] 2025-12-04T09:58:55.0421736Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0421837Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0422409Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0422458Z graph_break [] 2025-12-04T09:58:55.0422530Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0422570Z Autotune Choices Stats: 2025-12-04T09:58:55.0423322Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.0423452Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0423565Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0423734Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0424355Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0424957Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0425562Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0426201Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0426801Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0427426Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0428042Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0428655Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0429255Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0429856Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0429985Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:55.0430026Z Autotune Choices Stats: 2025-12-04T09:58:55.0430785Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.0431014Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0431177Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0431464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0432095Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0432739Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0433368Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0433990Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0434623Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0435246Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0435894Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0436574Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0437212Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0437835Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0437963Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:55.0438043Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0438085Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0438123Z unimplemented [] 2025-12-04T09:58:55.0438182Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0438283Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0438861Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0438902Z graph_break [] 2025-12-04T09:58:55.0438976Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0439016Z Autotune Choices Stats: 2025-12-04T09:58:55.0439757Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:55.0439913Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0440029Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0440193Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0440814Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0441432Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0442041Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0442644Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0443246Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0443849Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0444469Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0445084Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0445694Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0446335Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0446462Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:55.0446503Z Autotune Choices Stats: 2025-12-04T09:58:55.0447265Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:55.0447484Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0447649Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0447946Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0448600Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0449234Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0449867Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0450493Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0451125Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0451748Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0452372Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0453018Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0453658Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0454293Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0454422Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:55.0454496Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0454539Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0454580Z unimplemented [] 2025-12-04T09:58:55.0454644Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0454748Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0455324Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0455361Z graph_break [] 2025-12-04T09:58:55.0455434Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0455473Z Autotune Choices Stats: 2025-12-04T09:58:55.0456248Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:55.0456389Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0456505Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0456670Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0457303Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0457924Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0458536Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0459140Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0459743Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0460346Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0460952Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0461580Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0462194Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0462807Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0462935Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:55.0462975Z Autotune Choices Stats: 2025-12-04T09:58:55.0463741Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.0463958Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0464126Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0464403Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0465033Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0465689Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0466365Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0467000Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0467627Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0468259Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0468884Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0469510Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0470167Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0470804Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0470953Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:55.0471027Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0471072Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0471110Z unimplemented [] 2025-12-04T09:58:55.0471172Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0471273Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0471849Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0471886Z graph_break [] 2025-12-04T09:58:55.0471960Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0472001Z Autotune Choices Stats: 2025-12-04T09:58:55.0472743Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:55.0472872Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0472986Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0473148Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0473769Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0474371Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0474988Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0475600Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0476241Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0476847Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0477448Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0478052Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0478680Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0479303Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0479452Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:55.0479491Z Autotune Choices Stats: 2025-12-04T09:58:55.0480248Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.0480465Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0480631Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0480912Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0481548Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0482172Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0482817Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0483453Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0484089Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0484718Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0485342Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0486018Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0486641Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0487298Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0487427Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:55.0487502Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0487561Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0487599Z unimplemented [] 2025-12-04T09:58:55.0487675Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0487776Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0488352Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0488390Z graph_break [] 2025-12-04T09:58:55.0488466Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0488506Z Autotune Choices Stats: 2025-12-04T09:58:55.0489246Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.0489373Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0489487Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0489649Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0490263Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0490890Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0491494Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0492110Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0492722Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0493325Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0493928Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0494534Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0495162Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0495763Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0495893Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:55.0495959Z Autotune Choices Stats: 2025-12-04T09:58:55.0496752Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.0496984Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0497150Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0497427Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0498061Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0498688Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0499315Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0499967Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0500599Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0501239Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0501865Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0502495Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0503122Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0503776Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0503908Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:55.0503983Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0504029Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0504067Z unimplemented [] 2025-12-04T09:58:55.0504129Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0504228Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0504821Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0504868Z graph_break [] 2025-12-04T09:58:55.0504941Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0504982Z Autotune Choices Stats: 2025-12-04T09:58:55.0505728Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:55.0505858Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0506006Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0506168Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0506780Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0507385Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0508017Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0508624Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0509245Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0509861Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0510469Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0511075Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0511673Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0512307Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0512440Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:55.0512481Z Autotune Choices Stats: 2025-12-04T09:58:55.0513249Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.0513477Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0513642Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0513921Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0514556Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0515182Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0515805Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0516498Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0517132Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0517771Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0518406Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0519038Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0519671Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0520297Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0520442Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:55.0520521Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0520564Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0520605Z unimplemented [] 2025-12-04T09:58:55.0520667Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0520778Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0521353Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0521394Z graph_break [] 2025-12-04T09:58:55.0521469Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0521519Z Autotune Choices Stats: 2025-12-04T09:58:55.0522253Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.0522395Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0522509Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0522667Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0523277Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0523892Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0524498Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0525126Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0525734Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0526408Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0527026Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0527634Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0528241Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0528846Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0529006Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:55.0529047Z Autotune Choices Stats: 2025-12-04T09:58:55.0529823Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.0530040Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0530221Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0530509Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0531137Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0531763Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0532390Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0533018Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0533668Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0534296Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0534935Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0535574Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0536246Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0536877Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0537007Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:55.0537085Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0537126Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0537164Z unimplemented [] 2025-12-04T09:58:55.0537239Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0537342Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0537929Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0537970Z graph_break [] 2025-12-04T09:58:55.0538042Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0538087Z Autotune Choices Stats: 2025-12-04T09:58:55.0538843Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:55.0538985Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0539103Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0539263Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0539875Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0540480Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0541088Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0541700Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0542324Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0542936Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0543561Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0544171Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0544770Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0545377Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0545505Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:55.0545547Z Autotune Choices Stats: 2025-12-04T09:58:55.0546370Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.0546602Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0546769Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0547047Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0547701Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0548343Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0548968Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0549595Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0550224Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0550880Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0551512Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0552151Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0552785Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0553409Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0553538Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:55.0553616Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0553659Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0553699Z unimplemented [] 2025-12-04T09:58:55.0553761Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0553862Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0554435Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0554489Z graph_break [] 2025-12-04T09:58:55.0554564Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0554606Z Autotune Choices Stats: 2025-12-04T09:58:55.0555364Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:55.0555491Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0555619Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0555792Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0556454Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0557061Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0557666Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0558273Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0558877Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0559522Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0560148Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0560764Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0561367Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0561969Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0562097Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:55.0562141Z Autotune Choices Stats: 2025-12-04T09:58:55.0562904Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.0563130Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0563308Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0563586Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0564235Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0564872Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0565495Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0566205Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0566836Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0567466Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0568128Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0568768Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0569412Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0570038Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0570169Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:55.0570244Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0570289Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0570327Z unimplemented [] 2025-12-04T09:58:55.0570387Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0570488Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0571063Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0571100Z graph_break [] 2025-12-04T09:58:55.0571176Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0571216Z Autotune Choices Stats: 2025-12-04T09:58:55.0571968Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:55.0572110Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0572224Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0572385Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0573010Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0573625Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0574231Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0574837Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0575448Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0576099Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0576735Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0577356Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0577973Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0578579Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0578710Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:55.0578751Z Autotune Choices Stats: 2025-12-04T09:58:55.0579518Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.0579737Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0579903Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0580194Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0580838Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0581478Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0582120Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0582754Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0583386Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0584021Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0584652Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0585302Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0585983Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0586625Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0586757Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:55.0586833Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0586881Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0586919Z unimplemented [] 2025-12-04T09:58:55.0586982Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0587081Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0587655Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0587692Z graph_break [] 2025-12-04T09:58:55.0587769Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0587810Z Autotune Choices Stats: 2025-12-04T09:58:55.0588561Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.0588702Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0588817Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0588996Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0589602Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0590218Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0590835Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0591438Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0592038Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0592643Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0593270Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0593885Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0594502Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0595117Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0595250Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:55.0595290Z Autotune Choices Stats: 2025-12-04T09:58:55.0596091Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.0596312Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0596478Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0596757Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0597393Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0598046Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0598689Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0599324Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0599958Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0600587Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0601212Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0601856Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0602490Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0603126Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0603265Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:55.0603338Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0603382Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0603420Z unimplemented [] 2025-12-04T09:58:55.0603481Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0603580Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0604159Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0604197Z graph_break [] 2025-12-04T09:58:55.0604270Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0604312Z Autotune Choices Stats: 2025-12-04T09:58:55.0605054Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:55.0605181Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0605293Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0605466Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0606151Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0606751Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0607370Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0607991Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0608603Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0609206Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0609814Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0610435Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0611043Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0611655Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0611796Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:55.0611837Z Autotune Choices Stats: 2025-12-04T09:58:55.0612592Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.0612817Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0612991Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0613275Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0613916Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0614564Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0615191Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0615833Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0616513Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0617143Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0617774Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0618404Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0619062Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0619688Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0619819Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:55.0619913Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0619967Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0620012Z unimplemented [] 2025-12-04T09:58:55.0620072Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0620173Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0620748Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0620788Z graph_break [] 2025-12-04T09:58:55.0620863Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0620906Z Autotune Choices Stats: 2025-12-04T09:58:55.0621646Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:55.0621773Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0621889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0622051Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0622664Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0623289Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0623890Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0624512Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0625128Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0625731Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0626374Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0626974Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0627606Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0628205Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0628347Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:55.0628388Z Autotune Choices Stats: 2025-12-04T09:58:55.0629159Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.0629377Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0629542Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0629823Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0630452Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0631079Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0631730Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0632355Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0632996Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0633637Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0634265Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0634894Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0635525Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0636230Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0636360Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:55.0636435Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0636479Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0636519Z unimplemented [] 2025-12-04T09:58:55.0636581Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0636681Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0637272Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0637324Z graph_break [] 2025-12-04T09:58:55.0637398Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0637442Z Autotune Choices Stats: 2025-12-04T09:58:55.0638189Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:55.0638316Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0638433Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0638594Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0639203Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0639805Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0640444Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0641057Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0641670Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0642275Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0642882Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0643489Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0644093Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0644726Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0644856Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:55.0644899Z Autotune Choices Stats: 2025-12-04T09:58:55.0645674Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.0645904Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0646111Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0646392Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0647029Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0647659Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0648285Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0648938Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0649570Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0650220Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0650860Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0651488Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0652122Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0652749Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0652889Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:55.0652966Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0653010Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0653061Z unimplemented [] 2025-12-04T09:58:55.0653123Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0653228Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0653805Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0653845Z graph_break [] 2025-12-04T09:58:55.0653933Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0653984Z Autotune Choices Stats: 2025-12-04T09:58:55.0654722Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.0654850Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0654967Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0655129Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0655742Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0656405Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0657010Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0657651Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0658274Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0658895Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0659501Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0660101Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0660709Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0661307Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0661447Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:55.0661491Z Autotune Choices Stats: 2025-12-04T09:58:55.0662265Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.0662501Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0662673Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0662960Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0663590Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0664214Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0664846Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0665474Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0666162Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0666799Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0667443Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0668075Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0668704Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0669335Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0669466Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:55.0669543Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0669601Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0669642Z unimplemented [] 2025-12-04T09:58:55.0669701Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0669803Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0670391Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0670429Z graph_break [] 2025-12-04T09:58:55.0670507Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0670548Z Autotune Choices Stats: 2025-12-04T09:58:55.0671299Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:55.0671442Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0671556Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0671722Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0672334Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0672940Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0673544Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0674150Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0674782Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0675397Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0676046Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0676651Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0677256Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0677867Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0677999Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:55.0678041Z Autotune Choices Stats: 2025-12-04T09:58:55.0678844Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.0679061Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0679228Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0679528Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0680178Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0680815Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0681443Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0682068Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0682699Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0683357Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0683987Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0684624Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0685257Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0685882Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0686045Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:55.0686121Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0686166Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0686204Z unimplemented [] 2025-12-04T09:58:55.0686269Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0686368Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0686945Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0687004Z graph_break [] 2025-12-04T09:58:55.0687079Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0687121Z Autotune Choices Stats: 2025-12-04T09:58:55.0687876Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:55.0688024Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0688136Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0688311Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0688915Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0689523Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0690130Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0690737Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0691351Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0691970Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0692587Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0693202Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0693803Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0694409Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0694542Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:55.0694585Z Autotune Choices Stats: 2025-12-04T09:58:55.0695350Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.0695579Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0695754Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0696075Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0696729Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0697379Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0698008Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0698639Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0699270Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0699899Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0700550Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0701196Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0701832Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0702462Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0702594Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:55.0702668Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0702716Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0702755Z unimplemented [] 2025-12-04T09:58:55.0702817Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0702918Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0703497Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0703538Z graph_break [] 2025-12-04T09:58:55.0703611Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0703665Z Autotune Choices Stats: 2025-12-04T09:58:55.0704423Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1708", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.0704553Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0704669Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0704833Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0705463Z triton_flex_attention_1708 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0706116Z triton_flex_attention_1709 0.0109 ms 96.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0706724Z triton_flex_attention_1707 0.0117 ms 89.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0707328Z triton_flex_attention_1705 0.0130 ms 80.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0707941Z triton_flex_attention_1724 0.0135 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0708585Z triton_flex_attention_1706 0.0136 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0709185Z triton_flex_attention_1716 0.0142 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0709808Z triton_flex_attention_1722 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0710423Z triton_flex_attention_1714 0.0149 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0711030Z triton_flex_attention_1720 0.0162 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0711163Z SingleProcess AUTOTUNE benchmarking takes 0.2434 seconds and 0.4106 seconds precompiling for 24 choices 2025-12-04T09:58:55.0711205Z Autotune Choices Stats: 2025-12-04T09:58:55.0711966Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015359999611973763, "best_triton_pos": 0} 2025-12-04T09:58:55.0712188Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0712352Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0712642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0713289Z triton_flex_attention_backward_1743 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0713923Z triton_flex_attention_backward_1737 0.0181 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0714558Z triton_flex_attention_backward_1734 0.0187 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0715190Z triton_flex_attention_backward_1735 0.0188 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0715820Z triton_flex_attention_backward_1745 0.0203 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0716604Z triton_flex_attention_backward_1744 0.0203 ms 75.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0717256Z triton_flex_attention_backward_1742 0.0218 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0717886Z triton_flex_attention_backward_1747 0.0220 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0718526Z triton_flex_attention_backward_1738 0.0228 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0719157Z triton_flex_attention_backward_1729 0.0230 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0719291Z SingleProcess AUTOTUNE benchmarking takes 0.2527 seconds and 0.7882 seconds precompiling for 22 choices 2025-12-04T09:58:55.0719367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0719414Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0719452Z unimplemented [] 2025-12-04T09:58:55.0719517Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0719617Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0720195Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0720237Z graph_break [] 2025-12-04T09:58:55.0720313Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0720357Z Autotune Choices Stats: 2025-12-04T09:58:55.0721106Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1754", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009398999623954296, "best_triton_pos": 0} 2025-12-04T09:58:55.0721252Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0721375Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0721537Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0722146Z triton_flex_attention_1754 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0722761Z triton_flex_attention_1755 0.0104 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0723379Z triton_flex_attention_1752 0.0112 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0723984Z triton_flex_attention_1753 0.0117 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0724593Z triton_flex_attention_1750 0.0120 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0725198Z triton_flex_attention_1770 0.0132 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0725831Z triton_flex_attention_1751 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0726478Z triton_flex_attention_1762 0.0140 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0727099Z triton_flex_attention_1768 0.0146 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0727708Z triton_flex_attention_1760 0.0150 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0727842Z SingleProcess AUTOTUNE benchmarking takes 0.2227 seconds and 0.4678 seconds precompiling for 24 choices 2025-12-04T09:58:55.0727886Z Autotune Choices Stats: 2025-12-04T09:58:55.0728649Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.0728871Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0729040Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0729322Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0729984Z triton_flex_attention_backward_1789 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0730610Z triton_flex_attention_backward_1783 0.0184 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0731245Z triton_flex_attention_backward_1780 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0731873Z triton_flex_attention_backward_1781 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0732509Z triton_flex_attention_backward_1791 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0733136Z triton_flex_attention_backward_1790 0.0204 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0733762Z triton_flex_attention_backward_1788 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0734417Z triton_flex_attention_backward_1793 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0735045Z triton_flex_attention_backward_1784 0.0226 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0735680Z triton_flex_attention_backward_1775 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0735826Z SingleProcess AUTOTUNE benchmarking takes 0.2632 seconds and 0.8758 seconds precompiling for 22 choices 2025-12-04T09:58:55.0735971Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:55.0736021Z Traceback (most recent call last): 2025-12-04T09:58:55.0736175Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:55.0736216Z self.assertTrue( 2025-12-04T09:58:55.0736327Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:55.0736378Z raise self.failureException(msg) 2025-12-04T09:58:55.0736509Z AssertionError: False is not true : Log file /tmp/tmp8s4y4nc_/flex_attention_configs.json was not created 2025-12-04T09:58:55.0736512Z 2025-12-04T09:58:55.0736588Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.0736758Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.0736761Z 2025-12-04T09:58:55.0736852Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.0736931Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0736974Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0737015Z unimplemented [] 2025-12-04T09:58:55.0737077Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0737657Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:55.0737760Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0737822Z graph_break [] 2025-12-04T09:58:55.0737900Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0738384Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:55.0738448Z current_size = base.storage().size() 2025-12-04T09:58:55.0738490Z Autotune Choices Stats: 2025-12-04T09:58:55.0739233Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:55.0739375Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0739507Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0739670Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0740278Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0740886Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0741492Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0742094Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0742718Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0743318Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0743933Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0744542Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0745144Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0745752Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0745888Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:55.0745971Z Autotune Choices Stats: 2025-12-04T09:58:55.0746732Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.0746966Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0747146Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0747424Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0748072Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0748709Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0749333Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0749954Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0750591Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0751218Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0751862Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0752508Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0753159Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0753781Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0753913Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:55.0753990Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0754035Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0754073Z unimplemented [] 2025-12-04T09:58:55.0754134Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0754235Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0754819Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0754859Z graph_break [] 2025-12-04T09:58:55.0754937Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0754989Z Autotune Choices Stats: 2025-12-04T09:58:55.0755740Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:55.0755871Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0756024Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0756187Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0756826Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0757445Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0758049Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0758659Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0759267Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0759896Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0760496Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0761112Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0761724Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0762331Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0762466Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:55.0762509Z Autotune Choices Stats: 2025-12-04T09:58:55.0763262Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.0763480Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0763661Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0763940Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0764585Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0765223Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0765856Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0766521Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0767146Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0767776Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0768443Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0769073Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0769708Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0770350Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0770484Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:55.0770560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0770607Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0770645Z unimplemented [] 2025-12-04T09:58:55.0770709Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0770809Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0771385Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0771424Z graph_break [] 2025-12-04T09:58:55.0771505Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0771546Z Autotune Choices Stats: 2025-12-04T09:58:55.0772289Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:55.0772433Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0772558Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0772723Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0773334Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0773953Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0774568Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0775174Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0775779Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0776415Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0777053Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0777652Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0778265Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0778881Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0779014Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:55.0779056Z Autotune Choices Stats: 2025-12-04T09:58:55.0779827Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.0780045Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0780210Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0780487Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0781139Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0781766Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0782402Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0783036Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0783666Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0784288Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0784914Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0785570Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0786231Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0786873Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0787015Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:55.0787091Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0787138Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0787176Z unimplemented [] 2025-12-04T09:58:55.0787237Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0787337Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0787917Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0787961Z graph_break [] 2025-12-04T09:58:55.0788038Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0788080Z Autotune Choices Stats: 2025-12-04T09:58:55.0788827Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:55.0788957Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0789072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0789248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0789873Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0790479Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0791094Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0791706Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0792313Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0792920Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0793520Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0794142Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0794746Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0795361Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0795502Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:55.0795544Z Autotune Choices Stats: 2025-12-04T09:58:55.0796363Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.0796583Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0796748Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0797035Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0797670Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0798323Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0798949Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0799593Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0800243Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0800875Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0801505Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0802134Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0802785Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0803409Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0803551Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:55.0803641Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0803685Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0803727Z unimplemented [] 2025-12-04T09:58:55.0803788Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0803888Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0804462Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0804505Z graph_break [] 2025-12-04T09:58:55.0804581Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0804624Z Autotune Choices Stats: 2025-12-04T09:58:55.0805365Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.0805497Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0805631Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0805795Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0806445Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0807097Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0807716Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0808336Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0808941Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0809535Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0810148Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0810752Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0811373Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0811975Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0812118Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:55.0812174Z Autotune Choices Stats: 2025-12-04T09:58:55.0812932Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.0813156Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0813326Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0813603Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0814242Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0814866Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0815509Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0816173Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0816841Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0817482Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0818109Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0818732Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0819361Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0820030Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0822127Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:55.0822208Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0822254Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0822295Z unimplemented [] 2025-12-04T09:58:55.0822356Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0822458Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0823060Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0823109Z graph_break [] 2025-12-04T09:58:55.0823186Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0823230Z Autotune Choices Stats: 2025-12-04T09:58:55.0823973Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.0824100Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0824219Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0824384Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0824997Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0825606Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0826271Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0826887Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0827505Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0828109Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0828711Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0829315Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0829920Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0830538Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0830668Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:55.0830710Z Autotune Choices Stats: 2025-12-04T09:58:55.0831480Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.0831710Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0831888Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0832169Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0832804Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0833429Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0834056Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0834710Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0835346Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0836013Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0836653Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0837279Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0837906Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0838531Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0838682Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:55.0838758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0838813Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0838853Z unimplemented [] 2025-12-04T09:58:55.0838914Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0839013Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0839590Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0839642Z graph_break [] 2025-12-04T09:58:55.0839717Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0839767Z Autotune Choices Stats: 2025-12-04T09:58:55.0840499Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:55.0840627Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0840741Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0840903Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0841509Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0842115Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0842713Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0843334Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0843945Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0844556Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0845160Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0845764Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0846413Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0847018Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0847170Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:55.0847213Z Autotune Choices Stats: 2025-12-04T09:58:55.0847981Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.0848214Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0848393Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0848670Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0849304Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0849936Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0850559Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0851180Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0851842Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0852484Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0853116Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0853745Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0854378Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0855000Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0855128Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:55.0855203Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0855255Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0855295Z unimplemented [] 2025-12-04T09:58:55.0855355Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0855455Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0856078Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0856117Z graph_break [] 2025-12-04T09:58:55.0856192Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0856233Z Autotune Choices Stats: 2025-12-04T09:58:55.0856987Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:55.0857127Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0857241Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0857405Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0858014Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0858622Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0859224Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0859828Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0860460Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0861075Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0861686Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0862286Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0862885Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0863489Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0863618Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:55.0863660Z Autotune Choices Stats: 2025-12-04T09:58:55.0864431Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.0864646Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0864810Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0865100Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0865747Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0866407Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0867026Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0867652Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0868282Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0868936Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0869563Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0870205Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0870830Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0871451Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0871581Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:55.0871656Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0871699Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0871737Z unimplemented [] 2025-12-04T09:58:55.0871797Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0871897Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0872478Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0872528Z graph_break [] 2025-12-04T09:58:55.0872603Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0872644Z Autotune Choices Stats: 2025-12-04T09:58:55.0873393Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.0873531Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0873643Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0873816Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0874428Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0875033Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0875634Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0876295Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0876899Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0877551Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0878173Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0878788Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0879386Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0879984Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0880114Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:55.0880154Z Autotune Choices Stats: 2025-12-04T09:58:55.0880927Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.0881157Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0881334Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0881611Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0882256Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0882888Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0883515Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0884136Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0884760Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0885389Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0886107Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0886746Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0887386Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0888013Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0888141Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:55.0888215Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0888260Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0888296Z unimplemented [] 2025-12-04T09:58:55.0888357Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0888455Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0889029Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0889066Z graph_break [] 2025-12-04T09:58:55.0889140Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0889194Z Autotune Choices Stats: 2025-12-04T09:58:55.0889950Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:55.0890079Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0890192Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0890356Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0890977Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0891589Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0892192Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0892793Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0893395Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0894017Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0894624Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0895234Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0895843Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0896489Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0896620Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:55.0896659Z Autotune Choices Stats: 2025-12-04T09:58:55.0897413Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:55.0897629Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0897793Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0898097Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0898752Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0899380Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0900043Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0900665Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0901292Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0901917Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0902537Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0903191Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0903824Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0904457Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0904585Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:55.0904658Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0904702Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0904740Z unimplemented [] 2025-12-04T09:58:55.0904800Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0904897Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0905475Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0905515Z graph_break [] 2025-12-04T09:58:55.0905588Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0905631Z Autotune Choices Stats: 2025-12-04T09:58:55.0906410Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:55.0906560Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0906674Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0906856Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0907467Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0908091Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0908704Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0909303Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0909907Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0910507Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0911131Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0911731Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0912349Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0912961Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0913093Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:55.0913133Z Autotune Choices Stats: 2025-12-04T09:58:55.0913895Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:55.0914112Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0914279Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0914557Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0915185Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0915833Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0916526Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0917158Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0917783Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0918411Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0919032Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0919698Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0920322Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0920958Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0921098Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:55.0921173Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0921218Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0921255Z unimplemented [] 2025-12-04T09:58:55.0921315Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0921414Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0921992Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0922029Z graph_break [] 2025-12-04T09:58:55.0922102Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0922142Z Autotune Choices Stats: 2025-12-04T09:58:55.0922885Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:55.0923014Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0923128Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0923299Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0923919Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0924525Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0925136Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0925747Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0926393Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0927001Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0927601Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0928234Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0928832Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0929443Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0929595Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:55.0929635Z Autotune Choices Stats: 2025-12-04T09:58:55.0930393Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:55.0930611Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0930776Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0931059Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0931689Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0932326Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0932944Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0933579Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0934207Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0934834Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0935463Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0936132Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0936784Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0937408Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0937554Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:55.0937630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0937686Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0937726Z unimplemented [] 2025-12-04T09:58:55.0937786Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0937888Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0938464Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.0938505Z graph_break [] 2025-12-04T09:58:55.0938580Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0938622Z Autotune Choices Stats: 2025-12-04T09:58:55.0939351Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:55.0939479Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0939593Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0939754Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0940363Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0940987Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0941591Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0942199Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0942809Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0943410Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0944011Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0944613Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0945230Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0945832Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0946026Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:55.0946078Z Autotune Choices Stats: 2025-12-04T09:58:55.0946836Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.0947053Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0947223Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0947501Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0948131Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0948762Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0949407Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0950031Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0950671Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0951310Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0951930Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0952556Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0953189Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0953837Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0953965Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:55.0954040Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0954084Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0954124Z unimplemented [] 2025-12-04T09:58:55.0954188Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0954288Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0954868Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0954914Z graph_break [] 2025-12-04T09:58:55.0954989Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0955029Z Autotune Choices Stats: 2025-12-04T09:58:55.0955769Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:55.0955896Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0956045Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0956207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0956814Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0957419Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0958045Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0958657Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0959267Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0959875Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0960477Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0961077Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0961679Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0962311Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0962442Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:55.0962484Z Autotune Choices Stats: 2025-12-04T09:58:55.0963256Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:55.0963482Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0963649Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0963926Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0964563Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0965189Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0965814Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0966508Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0967138Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0967777Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0968414Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0969043Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0969670Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0970294Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0970435Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:55.0970510Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0970552Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0970603Z unimplemented [] 2025-12-04T09:58:55.0970664Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0970766Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0971348Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0971387Z graph_break [] 2025-12-04T09:58:55.0971476Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0971528Z Autotune Choices Stats: 2025-12-04T09:58:55.0972261Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:55.0972388Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0972504Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0972669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0973280Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0973877Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0974482Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0975105Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0975725Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0976377Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0976983Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0977589Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0978186Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0978789Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0978949Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:55.0978992Z Autotune Choices Stats: 2025-12-04T09:58:55.0979773Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.0980007Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0980175Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0980465Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0981090Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0981719Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0982348Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0982966Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0983617Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0984247Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0984879Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0985504Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0986164Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0986797Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0986927Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:55.0987001Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.0987069Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.0987105Z unimplemented [] 2025-12-04T09:58:55.0987166Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.0987265Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.0987851Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.0987888Z graph_break [] 2025-12-04T09:58:55.0987963Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.0988003Z Autotune Choices Stats: 2025-12-04T09:58:55.0988757Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.0988899Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0989012Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0989176Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0989794Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0990398Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0991007Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0991608Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0992232Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.0992848Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0993469Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0994076Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0994677Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0995280Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0995410Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:55.0995449Z Autotune Choices Stats: 2025-12-04T09:58:55.0996263Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.0996493Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.0996659Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.0996952Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.0997596Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0998223Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0998848Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.0999474Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1000105Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1000751Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1001383Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1002024Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1002652Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1003278Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1003408Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:55.1003483Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1003529Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1003566Z unimplemented [] 2025-12-04T09:58:55.1003627Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1003725Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1004306Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1004354Z graph_break [] 2025-12-04T09:58:55.1004429Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1004470Z Autotune Choices Stats: 2025-12-04T09:58:55.1005222Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.1005349Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1005474Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1005646Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1006295Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1006894Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1007495Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1008090Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1008694Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1009335Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1009956Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1010568Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1011172Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1011774Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1011903Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:55.1011943Z Autotune Choices Stats: 2025-12-04T09:58:55.1012700Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.1012930Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1013107Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1013386Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1014025Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1014661Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1015286Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1015912Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1016578Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1017206Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1017855Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1018489Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1019126Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1019750Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1019881Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:55.1019956Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1020002Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1020039Z unimplemented [] 2025-12-04T09:58:55.1020101Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1020202Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1020777Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1020817Z graph_break [] 2025-12-04T09:58:55.1020892Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1020935Z Autotune Choices Stats: 2025-12-04T09:58:55.1021696Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:55.1021824Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1021937Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1022101Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1022726Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1023340Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1023941Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1024550Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1025150Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1025764Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1026431Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1027049Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1027665Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1028265Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1028398Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:55.1028438Z Autotune Choices Stats: 2025-12-04T09:58:55.1029198Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.1029415Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1029579Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1029867Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1030509Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1031149Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1031785Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1032410Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1033037Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1033664Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1034287Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1034942Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1035581Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1036254Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1036385Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:55.1036461Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1036504Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1036544Z unimplemented [] 2025-12-04T09:58:55.1036605Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1036706Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1037278Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1037317Z graph_break [] 2025-12-04T09:58:55.1037392Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1037434Z Autotune Choices Stats: 2025-12-04T09:58:55.1038174Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.1038326Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1038441Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1038614Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1039224Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1039843Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1040460Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1041062Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1041667Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1042271Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1042902Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1043503Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1044115Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1044730Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1044861Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:55.1044903Z Autotune Choices Stats: 2025-12-04T09:58:55.1045656Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.1045873Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1046078Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1046355Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1046987Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1047646Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1048287Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1048927Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1049554Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1050184Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1050930Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1051587Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1052211Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1052846Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1052983Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:55.1053060Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1053103Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1053143Z unimplemented [] 2025-12-04T09:58:55.1053204Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1053303Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1053879Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1053919Z graph_break [] 2025-12-04T09:58:55.1053992Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1054034Z Autotune Choices Stats: 2025-12-04T09:58:55.1054778Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.1054905Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1055020Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1055192Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1055823Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1056455Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1057078Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1057704Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1058307Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1058913Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1059514Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1060147Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1060750Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1061365Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1061504Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:55.1061544Z Autotune Choices Stats: 2025-12-04T09:58:55.1062299Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:55.1062520Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1062685Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1062964Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1063594Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1064241Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1064867Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1065501Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1066172Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1066800Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1067427Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1068060Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1069424Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1070728Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1071525Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:55.1071780Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1071951Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1072068Z unimplemented [] 2025-12-04T09:58:55.1072192Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1072388Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1073101Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1073740Z graph_break [] 2025-12-04T09:58:55.1073869Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1074024Z Autotune Choices Stats: 2025-12-04T09:58:55.1074844Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.1075743Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1076064Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1076378Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1077182Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1078476Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1079730Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1081004Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1082257Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1083505Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1084752Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1086036Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1087316Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1088555Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1089351Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:55.1089561Z Autotune Choices Stats: 2025-12-04T09:58:55.1090396Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.1091405Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1091826Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1092310Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1093274Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1094558Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1095856Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1097180Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1098585Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1099884Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1101174Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1102468Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1103751Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1105071Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1105872Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:55.1106162Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1106320Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1106430Z unimplemented [] 2025-12-04T09:58:55.1106552Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1106748Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1107475Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1108127Z graph_break [] 2025-12-04T09:58:55.1108256Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1108409Z Autotune Choices Stats: 2025-12-04T09:58:55.1109209Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:55.1110100Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1110375Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1110687Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1111494Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1112732Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1113999Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1115246Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1116533Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1117782Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1119030Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1120275Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1121516Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1122771Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1123543Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:55.1123745Z Autotune Choices Stats: 2025-12-04T09:58:55.1124574Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:55.1125592Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1126049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1126528Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1127471Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1128761Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1130051Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1131359Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1132641Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1133942Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1135251Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1136570Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1137863Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1139153Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1139956Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:55.1140197Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1140355Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1140478Z unimplemented [] 2025-12-04T09:58:55.1140593Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1140790Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1141498Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1142133Z graph_break [] 2025-12-04T09:58:55.1142274Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1142440Z Autotune Choices Stats: 2025-12-04T09:58:55.1143242Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:55.1144145Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1144424Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1144737Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1145550Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1146829Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1148072Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1149349Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1150600Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1151853Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1153099Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1154345Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1155587Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1156858Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1157646Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:55.1157850Z Autotune Choices Stats: 2025-12-04T09:58:55.1158688Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.1159692Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1160122Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1160611Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1161555Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1162857Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1164145Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1165431Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1166781Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1168086Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1169374Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1170673Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1171964Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1173256Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1174047Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:55.1174286Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1174455Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1174565Z unimplemented [] 2025-12-04T09:58:55.1174685Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1174882Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1175601Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1176278Z graph_break [] 2025-12-04T09:58:55.1176409Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1176562Z Autotune Choices Stats: 2025-12-04T09:58:55.1177383Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:55.1178295Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1178570Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1178883Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1179697Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1180945Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1182194Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1183440Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1184697Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1186002Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1187258Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1188501Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1189752Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1190998Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1191768Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:55.1191973Z Autotune Choices Stats: 2025-12-04T09:58:55.1192827Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.1193851Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1194267Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1194757Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1195714Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1197030Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1198311Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1199605Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1200897Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1202239Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1203556Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1204857Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1206189Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1207475Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1208265Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:55.1208503Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1208662Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1208779Z unimplemented [] 2025-12-04T09:58:55.1208902Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1209098Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1209815Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1210465Z graph_break [] 2025-12-04T09:58:55.1210597Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1210750Z Autotune Choices Stats: 2025-12-04T09:58:55.1211566Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.1212545Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1212890Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1213213Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1214022Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1215266Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1216544Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1217786Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1219033Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1220317Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1221566Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1222823Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1224060Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1225308Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1226113Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:55.1226319Z Autotune Choices Stats: 2025-12-04T09:58:55.1227139Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.1228165Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1228600Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1229082Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1230042Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1231346Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1232633Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1233922Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1235216Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1236548Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1237861Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1239166Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1240469Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1241759Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1242550Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:55.1242793Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1242951Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1243064Z unimplemented [] 2025-12-04T09:58:55.1243181Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1243375Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1244085Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1244728Z graph_break [] 2025-12-04T09:58:55.1244858Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1245020Z Autotune Choices Stats: 2025-12-04T09:58:55.1245840Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:55.1246770Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1247049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1247360Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1248175Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1249434Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1250683Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1251928Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1253172Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1254430Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1255690Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1256981Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1258236Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1259482Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1260250Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:55.1260457Z Autotune Choices Stats: 2025-12-04T09:58:55.1261277Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.1262288Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1262704Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1263195Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1264165Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1265464Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1266800Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1268087Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1269376Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1270668Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1271964Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1273271Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1274576Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1275884Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1276721Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:55.1276965Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1277121Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1277230Z unimplemented [] 2025-12-04T09:58:55.1277352Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1277550Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1278266Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1278911Z graph_break [] 2025-12-04T09:58:55.1279041Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1279191Z Autotune Choices Stats: 2025-12-04T09:58:55.1279987Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.1280915Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1281188Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1281509Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1282310Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1283579Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1284841Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1286120Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1287365Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1288595Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1289862Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1291108Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1292360Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1293615Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1294387Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:55.1294590Z Autotune Choices Stats: 2025-12-04T09:58:55.1295409Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.1296453Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1296867Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1297343Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1298306Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1299603Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1300905Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1302187Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1303477Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1304767Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1306098Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1307409Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1308694Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1309998Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1311034Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:55.1311313Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1314682Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1314805Z unimplemented [] 2025-12-04T09:58:55.1314924Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1315124Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1315836Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1316526Z graph_break [] 2025-12-04T09:58:55.1316658Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1316814Z Autotune Choices Stats: 2025-12-04T09:58:55.1317618Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:55.1318514Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1318791Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1319133Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1319958Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1321198Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1322452Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1323707Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1324948Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1326235Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1327482Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1328759Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1330001Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1331252Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1332034Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:55.1332240Z Autotune Choices Stats: 2025-12-04T09:58:55.1333063Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.1334061Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1334475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1334957Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1335897Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1337251Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1338534Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1339826Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1341129Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1342418Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1343704Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1345013Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1346385Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1347664Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1348467Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:55.1348716Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1348872Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1348979Z unimplemented [] 2025-12-04T09:58:55.1349097Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1349291Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1349993Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1350625Z graph_break [] 2025-12-04T09:58:55.1350752Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1350905Z Autotune Choices Stats: 2025-12-04T09:58:55.1351709Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:55.1352602Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1352882Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1353198Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1354005Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1355276Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1356569Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1357822Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1359061Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1360316Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1361562Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1362805Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1364071Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1365313Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1366129Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:55.1366343Z Autotune Choices Stats: 2025-12-04T09:58:55.1367165Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.1368166Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1368582Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1369056Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1369998Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1371285Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1372589Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1373880Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1375182Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1376507Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1377799Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1379095Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1380375Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1381695Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1382487Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:55.1382723Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1382876Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1382984Z unimplemented [] 2025-12-04T09:58:55.1383098Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1383305Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1384013Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1384666Z graph_break [] 2025-12-04T09:58:55.1384793Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1384942Z Autotune Choices Stats: 2025-12-04T09:58:55.1385850Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:55.1386788Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1387061Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1387369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1388178Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1389435Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1390723Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1391976Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1393234Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1394482Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1395723Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1396996Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1398239Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1399507Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1400275Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:55.1400477Z Autotune Choices Stats: 2025-12-04T09:58:55.1401309Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.1402330Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1402743Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1403216Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1404167Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1405455Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1406778Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1408084Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1409377Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1410691Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1411983Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1413267Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1414566Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1415859Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1416694Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:55.1416949Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1417104Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1417211Z unimplemented [] 2025-12-04T09:58:55.1417325Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1417518Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1418236Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1418877Z graph_break [] 2025-12-04T09:58:55.1419021Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1419171Z Autotune Choices Stats: 2025-12-04T09:58:55.1419971Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.1420864Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1421138Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1421445Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1422247Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1423491Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1424735Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1426042Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1427293Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1428545Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1429792Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1431034Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1432276Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1433516Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1434295Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:55.1434502Z Autotune Choices Stats: 2025-12-04T09:58:55.1435326Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.1437973Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1438408Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1438882Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1439830Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1441127Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1443053Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1444337Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1445654Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1447001Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1448328Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1449626Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1450915Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1452214Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1453011Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:55.1453264Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1453431Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1453553Z unimplemented [] 2025-12-04T09:58:55.1453688Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1453888Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1454615Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1455250Z graph_break [] 2025-12-04T09:58:55.1455380Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1455533Z Autotune Choices Stats: 2025-12-04T09:58:55.1456384Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:55.1457301Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1457580Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1457891Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1458708Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1459951Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1461208Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1462452Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1463710Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1464963Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1466257Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1467513Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1468756Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1470014Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1470797Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:55.1471012Z Autotune Choices Stats: 2025-12-04T09:58:55.1471853Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.1472869Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1473291Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1473793Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1474732Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1476075Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1477362Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1478672Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1479978Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1481278Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1482585Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1483889Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1485186Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1486540Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1487361Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:55.1487685Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1487839Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1487949Z unimplemented [] 2025-12-04T09:58:55.1488068Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1488269Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1488986Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1489631Z graph_break [] 2025-12-04T09:58:55.1489778Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1489942Z Autotune Choices Stats: 2025-12-04T09:58:55.1490751Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:55.1491688Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1491975Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1492293Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1493108Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1494355Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1495666Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1496965Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1498226Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1498827Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1499450Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1500067Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1500683Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1501283Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1501436Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:55.1501487Z Autotune Choices Stats: 2025-12-04T09:58:55.1502254Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.1502481Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1502661Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1502943Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1503588Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1504238Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1504868Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1505499Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1506191Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1506821Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1507467Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1508114Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1508761Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1509392Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1509521Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:55.1509599Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1509641Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1509681Z unimplemented [] 2025-12-04T09:58:55.1509760Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1509865Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1510433Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1510472Z graph_break [] 2025-12-04T09:58:55.1510549Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1510590Z Autotune Choices Stats: 2025-12-04T09:58:55.1511337Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:55.1511470Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1511586Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1511743Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1512373Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1512984Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1513586Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1514187Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1514805Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1515422Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1516090Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1516712Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1517326Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1517933Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1518061Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:55.1518105Z Autotune Choices Stats: 2025-12-04T09:58:55.1518859Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.1519090Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1519254Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1519550Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1520184Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1520826Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1521460Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1522093Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1522722Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1523361Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1524002Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1524628Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1525259Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1525894Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1526058Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:55.1526136Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1526179Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1526218Z unimplemented [] 2025-12-04T09:58:55.1526279Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1526384Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1526955Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1527016Z graph_break [] 2025-12-04T09:58:55.1527092Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1527132Z Autotune Choices Stats: 2025-12-04T09:58:55.1527873Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.1528001Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1528128Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1528289Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1528918Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1529534Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1530136Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1530740Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1531347Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1531963Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1532578Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1533183Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1533808Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1534411Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1534543Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:55.1534585Z Autotune Choices Stats: 2025-12-04T09:58:55.1535343Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.1535572Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1535738Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1536045Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1536696Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1537324Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1537962Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1538598Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1539227Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1539861Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1540501Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1541142Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1541774Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1542436Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1542566Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:55.1542642Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1542683Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1542724Z unimplemented [] 2025-12-04T09:58:55.1542786Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1542888Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1543462Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1543500Z graph_break [] 2025-12-04T09:58:55.1543576Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1543616Z Autotune Choices Stats: 2025-12-04T09:58:55.1544363Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:55.1544502Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1544616Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1544777Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1545400Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1546057Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1546671Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1547275Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1547882Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1548491Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1549113Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1549729Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1550337Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1550948Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1551076Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:55.1551118Z Autotune Choices Stats: 2025-12-04T09:58:55.1551877Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.1552094Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1552259Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1552555Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1553191Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1553828Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1554454Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1555105Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1555736Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1556422Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1557047Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1557700Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1558343Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1558980Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1559125Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:55.1559200Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1559243Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1559280Z unimplemented [] 2025-12-04T09:58:55.1559342Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1559442Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1560020Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1560057Z graph_break [] 2025-12-04T09:58:55.1560133Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1560172Z Autotune Choices Stats: 2025-12-04T09:58:55.1560907Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:55.1561046Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1561159Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1561318Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1561925Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1562542Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1563159Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1563779Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1564385Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1564989Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1565605Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1566252Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1566877Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1567495Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1567648Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:55.1567688Z Autotune Choices Stats: 2025-12-04T09:58:55.1568452Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.1568674Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1568837Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1569116Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1569743Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1570383Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1571028Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1571661Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1572301Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1572930Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1573556Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1574197Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1574819Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1575462Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1575590Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:55.1575663Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1575706Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1575745Z unimplemented [] 2025-12-04T09:58:55.1575829Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1575967Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1576542Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1576578Z graph_break [] 2025-12-04T09:58:55.1576654Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1576693Z Autotune Choices Stats: 2025-12-04T09:58:55.1577440Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1708", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.1577569Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1577682Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1577859Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1578477Z triton_flex_attention_1708 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1579081Z triton_flex_attention_1709 0.0109 ms 96.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1579700Z triton_flex_attention_1707 0.0117 ms 89.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1580316Z triton_flex_attention_1705 0.0130 ms 80.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1580937Z triton_flex_attention_1724 0.0135 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1581541Z triton_flex_attention_1706 0.0136 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1582148Z triton_flex_attention_1716 0.0142 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1582763Z triton_flex_attention_1722 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1583363Z triton_flex_attention_1714 0.0149 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1583974Z triton_flex_attention_1720 0.0162 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1584103Z SingleProcess AUTOTUNE benchmarking takes 0.2434 seconds and 0.4106 seconds precompiling for 24 choices 2025-12-04T09:58:55.1584142Z Autotune Choices Stats: 2025-12-04T09:58:55.1584913Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015359999611973763, "best_triton_pos": 0} 2025-12-04T09:58:55.1585141Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1585304Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1585584Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1586249Z triton_flex_attention_backward_1743 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1586889Z triton_flex_attention_backward_1737 0.0181 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1587517Z triton_flex_attention_backward_1734 0.0187 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1588157Z triton_flex_attention_backward_1735 0.0188 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1588802Z triton_flex_attention_backward_1745 0.0203 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1589434Z triton_flex_attention_backward_1744 0.0203 ms 75.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1590058Z triton_flex_attention_backward_1742 0.0218 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1590692Z triton_flex_attention_backward_1747 0.0220 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1591334Z triton_flex_attention_backward_1738 0.0228 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1591956Z triton_flex_attention_backward_1729 0.0230 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1592098Z SingleProcess AUTOTUNE benchmarking takes 0.2527 seconds and 0.7882 seconds precompiling for 22 choices 2025-12-04T09:58:55.1592176Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1592218Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1592257Z unimplemented [] 2025-12-04T09:58:55.1592318Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1592417Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1592995Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1593047Z graph_break [] 2025-12-04T09:58:55.1593121Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1593163Z Autotune Choices Stats: 2025-12-04T09:58:55.1593906Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1754", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009398999623954296, "best_triton_pos": 0} 2025-12-04T09:58:55.1594035Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1594151Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1594313Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1594930Z triton_flex_attention_1754 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1595545Z triton_flex_attention_1755 0.0104 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1596190Z triton_flex_attention_1752 0.0112 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1596795Z triton_flex_attention_1753 0.0117 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1597417Z triton_flex_attention_1750 0.0120 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1598033Z triton_flex_attention_1770 0.0132 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1598635Z triton_flex_attention_1751 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1599239Z triton_flex_attention_1762 0.0140 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1599860Z triton_flex_attention_1768 0.0146 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1600461Z triton_flex_attention_1760 0.0150 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1600607Z SingleProcess AUTOTUNE benchmarking takes 0.2227 seconds and 0.4678 seconds precompiling for 24 choices 2025-12-04T09:58:55.1600648Z Autotune Choices Stats: 2025-12-04T09:58:55.1601411Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.1601639Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1601806Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1602081Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1602707Z triton_flex_attention_backward_1789 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1603335Z triton_flex_attention_backward_1783 0.0184 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1603969Z triton_flex_attention_backward_1780 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1604594Z triton_flex_attention_backward_1781 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1605234Z triton_flex_attention_backward_1791 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1605871Z triton_flex_attention_backward_1790 0.0204 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1606546Z triton_flex_attention_backward_1788 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1607176Z triton_flex_attention_backward_1793 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1607805Z triton_flex_attention_backward_1784 0.0226 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1608458Z triton_flex_attention_backward_1775 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1608587Z SingleProcess AUTOTUNE benchmarking takes 0.2632 seconds and 0.8758 seconds precompiling for 22 choices 2025-12-04T09:58:55.1608663Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1608706Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1608747Z unimplemented [] 2025-12-04T09:58:55.1608807Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1608907Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1609497Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1609538Z graph_break [] 2025-12-04T09:58:55.1609611Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1609652Z Autotune Choices Stats: 2025-12-04T09:58:55.1610412Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1801", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.1610557Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1610676Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1610842Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1611458Z triton_flex_attention_1801 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1612071Z triton_flex_attention_1800 0.0108 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1612691Z triton_flex_attention_1816 0.0128 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1613310Z triton_flex_attention_1798 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1613914Z triton_flex_attention_1797 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1614528Z triton_flex_attention_1808 0.0133 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1615139Z triton_flex_attention_1814 0.0140 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1615746Z triton_flex_attention_1806 0.0150 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1616381Z triton_flex_attention_1799 0.0158 ms 66.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1617006Z triton_flex_attention_1812 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1617133Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4169 seconds precompiling for 24 choices 2025-12-04T09:58:55.1617177Z Autotune Choices Stats: 2025-12-04T09:58:55.1617955Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1835", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.1618173Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1618353Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1618642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1619275Z triton_flex_attention_backward_1835 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1619903Z triton_flex_attention_backward_1829 0.0184 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1620530Z triton_flex_attention_backward_1826 0.0186 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1621163Z triton_flex_attention_backward_1827 0.0186 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1621802Z triton_flex_attention_backward_1837 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1622431Z triton_flex_attention_backward_1836 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1623061Z triton_flex_attention_backward_1834 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1623703Z triton_flex_attention_backward_1839 0.0221 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1624336Z triton_flex_attention_backward_1830 0.0228 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1624965Z triton_flex_attention_backward_1821 0.0230 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1625105Z SingleProcess AUTOTUNE benchmarking takes 0.2624 seconds and 0.8439 seconds precompiling for 22 choices 2025-12-04T09:58:55.1625198Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:55.1625247Z Traceback (most recent call last): 2025-12-04T09:58:55.1625400Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:55.1625440Z self.assertTrue( 2025-12-04T09:58:55.1625546Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:55.1625595Z raise self.failureException(msg) 2025-12-04T09:58:55.1625726Z AssertionError: False is not true : Log file /tmp/tmp1x81keg9/flex_attention_configs.json was not created 2025-12-04T09:58:55.1625731Z 2025-12-04T09:58:55.1625808Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.1626047Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.1626049Z 2025-12-04T09:58:55.1626140Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.1626216Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1626258Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1626294Z unimplemented [] 2025-12-04T09:58:55.1626357Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1626951Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:55.1627069Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1627104Z graph_break [] 2025-12-04T09:58:55.1627179Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1627662Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:55.1627712Z current_size = base.storage().size() 2025-12-04T09:58:55.1627753Z Autotune Choices Stats: 2025-12-04T09:58:55.1628502Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:55.1628631Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1628745Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1628922Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1629534Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1630150Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1630753Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1631367Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1631985Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1632586Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1633185Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1633796Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1634392Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1635006Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1635137Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:55.1635177Z Autotune Choices Stats: 2025-12-04T09:58:55.1636000Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.1636230Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1636393Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1636672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1637298Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1637935Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1638557Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1639189Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1639826Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1640463Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1641083Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1641709Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1642342Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1642967Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1643109Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:55.1643184Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1643228Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1643265Z unimplemented [] 2025-12-04T09:58:55.1643327Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1643425Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1644013Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1644062Z graph_break [] 2025-12-04T09:58:55.1644135Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1644176Z Autotune Choices Stats: 2025-12-04T09:58:55.1644918Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:55.1645047Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1645159Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1645321Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1645959Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1646572Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1647191Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1647795Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1648416Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1649024Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1649629Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1650231Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1650842Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1651445Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1651589Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:55.1651629Z Autotune Choices Stats: 2025-12-04T09:58:55.1652399Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.1652629Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1652791Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1653068Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1653705Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1654324Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1654955Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1655577Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1656256Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1656893Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1657526Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1658153Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1658776Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1659407Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1659536Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:55.1659611Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1659654Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1659693Z unimplemented [] 2025-12-04T09:58:55.1659755Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1659853Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1660441Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1660482Z graph_break [] 2025-12-04T09:58:55.1660555Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1660600Z Autotune Choices Stats: 2025-12-04T09:58:55.1661355Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:55.1661493Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1661608Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1661769Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1662386Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1662986Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1663593Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1664211Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1664817Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1665430Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1666087Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1666691Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1667297Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1667909Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1668040Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:55.1668080Z Autotune Choices Stats: 2025-12-04T09:58:55.1668851Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.1669072Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1669249Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1669539Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1670169Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1670793Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1671417Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1672052Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1672679Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1673317Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1673955Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1674590Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1675213Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1675837Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1676010Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:55.1676089Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1676131Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1676169Z unimplemented [] 2025-12-04T09:58:55.1676229Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1676330Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1676904Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1676964Z graph_break [] 2025-12-04T09:58:55.1677039Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1677080Z Autotune Choices Stats: 2025-12-04T09:58:55.1677827Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:55.1677970Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1678088Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1678248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1678861Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1679465Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1680066Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1680693Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1681309Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1681914Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1682528Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1683140Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1683741Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1684342Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1684482Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:55.1684524Z Autotune Choices Stats: 2025-12-04T09:58:55.1685287Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.1685516Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1685680Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1685989Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1686645Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1687287Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1687909Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1688531Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1689178Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1689826Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1690446Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1691084Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1691725Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1692349Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1692476Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:55.1692552Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1692605Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1692644Z unimplemented [] 2025-12-04T09:58:55.1692709Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1692808Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1693377Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1693415Z graph_break [] 2025-12-04T09:58:55.1693490Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1693531Z Autotune Choices Stats: 2025-12-04T09:58:55.1694274Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.1694399Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1694515Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1694687Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1695319Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1695962Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1696569Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1697163Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1697787Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1698412Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1699019Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1699653Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1700263Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1700873Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1701004Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:55.1701052Z Autotune Choices Stats: 2025-12-04T09:58:55.1701809Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.1702040Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1702205Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1702498Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1703132Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1703769Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1704403Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1705031Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1705657Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1706337Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1706991Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1707611Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1708267Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1708885Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1709015Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:55.1709093Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1709135Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1709175Z unimplemented [] 2025-12-04T09:58:55.1709237Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1709337Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1709915Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1709968Z graph_break [] 2025-12-04T09:58:55.1710042Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1710081Z Autotune Choices Stats: 2025-12-04T09:58:55.1710822Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.1710949Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1711074Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1711235Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1711862Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1712474Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1713075Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1713682Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1714288Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1714895Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1715510Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1716255Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1716873Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1717472Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1717602Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:55.1717718Z Autotune Choices Stats: 2025-12-04T09:58:55.1718476Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.1718711Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1718877Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1719153Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1719799Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1720427Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1721068Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1721692Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1722322Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1722961Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1723596Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1724235Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1724877Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1725513Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1725641Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:55.1725716Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1725763Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1725799Z unimplemented [] 2025-12-04T09:58:55.1725862Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1725997Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1726572Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1726608Z graph_break [] 2025-12-04T09:58:55.1726682Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1726723Z Autotune Choices Stats: 2025-12-04T09:58:55.1727473Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:55.1727601Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1727714Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1727879Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1728505Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1729122Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1729738Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1730341Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1730946Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1731552Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1732164Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1732776Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1733384Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1734001Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1734130Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:55.1734170Z Autotune Choices Stats: 2025-12-04T09:58:55.1734930Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.1735148Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1735316Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1735604Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1736269Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1736904Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1737538Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1738176Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1738802Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1739424Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1740045Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1740686Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1741321Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1741955Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1742096Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:55.1742169Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1742214Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1742250Z unimplemented [] 2025-12-04T09:58:55.1742311Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1742410Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1742988Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1743027Z graph_break [] 2025-12-04T09:58:55.1743103Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1743142Z Autotune Choices Stats: 2025-12-04T09:58:55.1743883Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:55.1744029Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1744144Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1744305Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1744920Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1745535Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1746205Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1746823Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1747429Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1748034Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1748654Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1749256Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1749873Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1750479Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1750619Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:55.1750660Z Autotune Choices Stats: 2025-12-04T09:58:55.1751418Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.1751639Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1751804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1752083Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1752714Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1753342Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1753980Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1754614Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1755252Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1755883Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1756547Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1757192Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1757817Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1758469Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1758598Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:55.1758673Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1758717Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1758756Z unimplemented [] 2025-12-04T09:58:55.1758859Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1758960Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1759528Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1759567Z graph_break [] 2025-12-04T09:58:55.1759642Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1759682Z Autotune Choices Stats: 2025-12-04T09:58:55.1760427Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.1760564Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1760676Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1760851Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1761466Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1762067Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1762685Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1763296Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1763911Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1764510Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1765115Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1765722Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1766364Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1766980Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1767109Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:55.1767147Z Autotune Choices Stats: 2025-12-04T09:58:55.1767922Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.1768152Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1768317Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1768597Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1769230Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1769868Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1770489Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1771129Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1771768Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1772400Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1773028Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1773651Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1774290Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1774983Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1775124Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:55.1775202Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1775245Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1775285Z unimplemented [] 2025-12-04T09:58:55.1775347Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1775447Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1776080Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1776134Z graph_break [] 2025-12-04T09:58:55.1776208Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1776249Z Autotune Choices Stats: 2025-12-04T09:58:55.1776988Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:55.1777119Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1777233Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1777392Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1778003Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1778621Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1779221Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1779833Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1780446Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1781061Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1781666Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1782266Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1782879Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1783479Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1783620Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:55.1783661Z Autotune Choices Stats: 2025-12-04T09:58:55.1784416Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:55.1784659Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1784827Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1785107Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1785739Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1786396Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1787041Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1787667Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1788304Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1788936Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1789571Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1790199Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1790823Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1791464Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1791594Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:55.1791669Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1791712Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1791752Z unimplemented [] 2025-12-04T09:58:55.1791814Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1791914Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1792500Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1792537Z graph_break [] 2025-12-04T09:58:55.1792610Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1792650Z Autotune Choices Stats: 2025-12-04T09:58:55.1793399Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:55.1793537Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1793650Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1793808Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1794426Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1795028Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1795643Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1796303Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1796901Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1797512Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1798127Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1798734Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1799337Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1799948Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1800075Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:55.1800116Z Autotune Choices Stats: 2025-12-04T09:58:55.1800892Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:55.1801113Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1801278Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1801576Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1802200Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1802826Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1803448Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1804085Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1804713Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1805355Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1806032Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1806666Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1807295Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1807918Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1808068Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:55.1808142Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1808184Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1808223Z unimplemented [] 2025-12-04T09:58:55.1808283Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1808383Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1808959Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1808999Z graph_break [] 2025-12-04T09:58:55.1809090Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1809132Z Autotune Choices Stats: 2025-12-04T09:58:55.1809886Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:55.1810025Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1810142Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1810301Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1810916Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1811522Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1812124Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1812734Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1813344Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1813950Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1814562Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1815176Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1815781Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1816424Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1816570Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:55.1816611Z Autotune Choices Stats: 2025-12-04T09:58:55.1817367Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:55.1817585Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1817764Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1818037Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1818690Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1819330Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1819951Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1820579Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1821221Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1821860Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1822483Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1823125Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1823770Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1824397Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1824527Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:55.1824604Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1824647Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1824696Z unimplemented [] 2025-12-04T09:58:55.1824757Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1824859Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1825440Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1825478Z graph_break [] 2025-12-04T09:58:55.1825554Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1825594Z Autotune Choices Stats: 2025-12-04T09:58:55.1826396Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:55.1826523Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1826637Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1826798Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1827440Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1828042Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1828646Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1829242Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1829860Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1830477Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1833367Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1834016Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1834621Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1835224Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1835355Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:55.1835397Z Autotune Choices Stats: 2025-12-04T09:58:55.1836189Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.1836428Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1836602Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1836896Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1837526Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1838171Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1838812Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1839436Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1840064Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1840705Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1841340Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1841968Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1842612Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1843252Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1843384Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:55.1843461Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1843505Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1843542Z unimplemented [] 2025-12-04T09:58:55.1843605Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1843707Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1844290Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1844338Z graph_break [] 2025-12-04T09:58:55.1844416Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1844456Z Autotune Choices Stats: 2025-12-04T09:58:55.1845203Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:55.1845333Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1845462Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1845626Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1846317Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1846935Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1847540Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1848146Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1848749Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1849359Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1849980Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1850576Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1851192Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1851794Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1851926Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:55.1851966Z Autotune Choices Stats: 2025-12-04T09:58:55.1852722Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:55.1852950Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1853116Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1853396Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1854040Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1854668Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1855303Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1855938Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1856596Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1857222Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1857865Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1858510Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1859138Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1859787Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1859916Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:55.1859990Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1860034Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1860073Z unimplemented [] 2025-12-04T09:58:55.1860135Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1860236Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1860816Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1860853Z graph_break [] 2025-12-04T09:58:55.1860927Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1860967Z Autotune Choices Stats: 2025-12-04T09:58:55.1861700Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:55.1861838Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1861951Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1862112Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1862740Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1863481Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1864093Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1864700Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1865305Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1865913Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1866566Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1867185Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1867801Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1868421Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1868550Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:55.1868589Z Autotune Choices Stats: 2025-12-04T09:58:55.1869350Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.1869568Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1869732Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1870028Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1870657Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1871292Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1871915Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1872564Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1873191Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1873820Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1874447Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1875093Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1875729Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1876409Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1876551Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:55.1876626Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1876671Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1876709Z unimplemented [] 2025-12-04T09:58:55.1876772Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1876872Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1877446Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1877486Z graph_break [] 2025-12-04T09:58:55.1877559Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1877600Z Autotune Choices Stats: 2025-12-04T09:58:55.1878333Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.1878475Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1878588Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1878750Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1879364Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1879976Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1880590Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1881201Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1881803Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1882408Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1883012Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1883628Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1884239Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1884850Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1884993Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:55.1885033Z Autotune Choices Stats: 2025-12-04T09:58:55.1885797Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.1886060Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1886226Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1886505Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1887131Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1887769Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1888406Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1889034Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1889680Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1890310Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1890934Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1891561Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1892204Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1892838Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1892967Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:55.1893041Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1893082Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1893119Z unimplemented [] 2025-12-04T09:58:55.1893181Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1893301Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1893878Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1893916Z graph_break [] 2025-12-04T09:58:55.1893989Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1894029Z Autotune Choices Stats: 2025-12-04T09:58:55.1894775Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.1894903Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1895018Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1895177Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1895796Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1896490Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1897109Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1897718Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1898334Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1898939Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1899542Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1900153Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1900759Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1901374Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1901502Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:55.1901542Z Autotune Choices Stats: 2025-12-04T09:58:55.1902305Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.1902532Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1902695Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1902973Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1903606Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1904294Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1904928Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1905567Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1906249Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1906887Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1907518Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1908148Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1908787Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1909410Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1909541Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:55.1909627Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1909670Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1909707Z unimplemented [] 2025-12-04T09:58:55.1909766Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1909866Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1910452Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1910503Z graph_break [] 2025-12-04T09:58:55.1910576Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1910616Z Autotune Choices Stats: 2025-12-04T09:58:55.1911355Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:55.1911483Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1911597Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1911757Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1912369Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1912979Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1913579Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1914195Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1914810Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1915421Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1916066Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1916670Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1917295Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1917898Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1918026Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:55.1918080Z Autotune Choices Stats: 2025-12-04T09:58:55.1918836Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.1919079Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1919245Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1919524Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1920155Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1920781Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1921420Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1922042Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1922689Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1923329Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1923961Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1924588Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1925213Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1925859Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1926027Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:55.1926102Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1926143Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1926181Z unimplemented [] 2025-12-04T09:58:55.1926240Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1926341Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1926937Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1926974Z graph_break [] 2025-12-04T09:58:55.1927049Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1927089Z Autotune Choices Stats: 2025-12-04T09:58:55.1927854Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.1927995Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1928109Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1928268Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1928882Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1929485Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1930116Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1930717Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1931333Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1931948Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1932565Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1933171Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1933775Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1934392Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1934525Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:55.1934565Z Autotune Choices Stats: 2025-12-04T09:58:55.1935334Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.1935551Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1935715Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1936082Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1936711Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1937337Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1937959Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1938614Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1939246Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1939885Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1940517Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1941156Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1941787Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1942413Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1942549Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:55.1942624Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1942666Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1942703Z unimplemented [] 2025-12-04T09:58:55.1942763Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1942863Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1943438Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1943475Z graph_break [] 2025-12-04T09:58:55.1943561Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1943600Z Autotune Choices Stats: 2025-12-04T09:58:55.1944344Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.1944491Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1944606Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1944766Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1945370Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1946012Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1946616Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1947231Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1947844Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1948448Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1949057Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1949672Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1950277Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1950881Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1951020Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:55.1951061Z Autotune Choices Stats: 2025-12-04T09:58:55.1951810Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:55.1952029Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1952204Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1952483Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1953127Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1953769Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1954391Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1955017Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1955657Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1956337Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1956975Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1957613Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1958252Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1958876Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1959004Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:55.1959080Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1959122Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1959159Z unimplemented [] 2025-12-04T09:58:55.1959234Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1959335Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1959910Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.1959946Z graph_break [] 2025-12-04T09:58:55.1960022Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1960061Z Autotune Choices Stats: 2025-12-04T09:58:55.1960812Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.1960941Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1961057Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1961216Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1961853Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1962456Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1963061Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1963664Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1964275Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1964889Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1965492Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1966151Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1966763Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1967372Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1967505Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:55.1967545Z Autotune Choices Stats: 2025-12-04T09:58:55.1968316Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.1968554Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1968717Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1968994Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1969639Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1970281Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1970915Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1971539Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1972165Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1972816Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1973456Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1974081Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1974718Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1975355Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1975485Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:55.1975559Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1975602Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1975638Z unimplemented [] 2025-12-04T09:58:55.1975700Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1975799Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1976469Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1976526Z graph_break [] 2025-12-04T09:58:55.1976602Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1976641Z Autotune Choices Stats: 2025-12-04T09:58:55.1977381Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:55.1977510Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1977636Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1977799Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1978407Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1979028Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1979636Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1980239Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1980842Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1981452Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1982069Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1982672Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1983301Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1983911Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1984041Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:55.1984082Z Autotune Choices Stats: 2025-12-04T09:58:55.1984836Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:55.1985068Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1985232Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1985511Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1986261Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1986888Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1987527Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1988162Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1988792Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1989420Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1990061Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1990703Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1991330Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1991975Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1992102Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:55.1992177Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.1992220Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.1992256Z unimplemented [] 2025-12-04T09:58:55.1992318Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.1992418Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.1992999Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.1993036Z graph_break [] 2025-12-04T09:58:55.1993109Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.1993149Z Autotune Choices Stats: 2025-12-04T09:58:55.1993891Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:55.1994030Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.1994143Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.1994305Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.1994932Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1995545Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1996225Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1996828Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.1997429Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1998032Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1998645Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1999264Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.1999866Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2000495Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2000623Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:55.2000662Z Autotune Choices Stats: 2025-12-04T09:58:55.2001416Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.2001634Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2001800Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2002094Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2002730Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2003368Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2004002Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2004649Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2005276Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2005900Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2006574Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2007222Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2007858Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2008503Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2008648Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:55.2008723Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2008765Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2008802Z unimplemented [] 2025-12-04T09:58:55.2008862Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2008963Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2009529Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2009568Z graph_break [] 2025-12-04T09:58:55.2009641Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2009684Z Autotune Choices Stats: 2025-12-04T09:58:55.2010424Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:55.2010564Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2010679Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2010837Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2011447Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2012065Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2012679Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2013296Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2013908Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2014513Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2015122Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2015731Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2016396Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2017008Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2017155Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:55.2017195Z Autotune Choices Stats: 2025-12-04T09:58:55.2017957Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.2018175Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2018340Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2018615Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2019253Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2019893Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2020526Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2021162Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2021815Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2022444Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2023067Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2023695Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2024341Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2024978Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2025104Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:55.2025179Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2025220Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2025259Z unimplemented [] 2025-12-04T09:58:55.2025319Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2025440Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2026053Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2026092Z graph_break [] 2025-12-04T09:58:55.2026166Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2026206Z Autotune Choices Stats: 2025-12-04T09:58:55.2026949Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.2027075Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2027189Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2027348Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2027976Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2028587Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2029217Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2029838Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2030458Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2031067Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2031663Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2032274Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2032888Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2033501Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2033628Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:55.2033669Z Autotune Choices Stats: 2025-12-04T09:58:55.2034447Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.2034673Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2034838Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2035121Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2035753Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2036422Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2037059Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2037701Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2038343Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2038987Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2039614Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2040243Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2040881Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2041506Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2041633Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:55.2041718Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2041760Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2041799Z unimplemented [] 2025-12-04T09:58:55.2041862Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2041962Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2042548Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2042598Z graph_break [] 2025-12-04T09:58:55.2042674Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2042713Z Autotune Choices Stats: 2025-12-04T09:58:55.2043453Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:55.2043579Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2043695Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2043854Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2044470Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2045082Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2045685Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2046346Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2046968Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2047577Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2048181Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2048786Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2049408Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2050008Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2050153Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:55.2050194Z Autotune Choices Stats: 2025-12-04T09:58:55.2050957Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.2051195Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2051361Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2051637Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2052268Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2052887Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2053529Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2054157Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2054797Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2055441Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2056113Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2056741Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2057367Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2058020Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2058147Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:55.2058221Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2058263Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2058300Z unimplemented [] 2025-12-04T09:58:55.2058363Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2058462Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2059045Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2059083Z graph_break [] 2025-12-04T09:58:55.2059157Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2059196Z Autotune Choices Stats: 2025-12-04T09:58:55.2059955Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.2060097Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2060210Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2060371Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2060983Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2061593Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2062212Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2062828Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2063430Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2064043Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2064661Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2065263Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2065870Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2066544Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2066675Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:55.2066714Z Autotune Choices Stats: 2025-12-04T09:58:55.2067494Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.2067711Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2067887Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2068174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2068805Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2069435Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2070059Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2070696Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2071325Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2071964Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2072600Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2073237Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2073865Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2074492Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2074631Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:55.2074704Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2074745Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2074782Z unimplemented [] 2025-12-04T09:58:55.2074843Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2074942Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2075518Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2075555Z graph_break [] 2025-12-04T09:58:55.2075643Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2075684Z Autotune Choices Stats: 2025-12-04T09:58:55.2076472Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:55.2076612Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2076725Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2076886Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2077497Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2078104Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2078707Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2079324Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2079940Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2080546Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2081161Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2081783Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2082392Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2082998Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2083139Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:55.2083179Z Autotune Choices Stats: 2025-12-04T09:58:55.2083941Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.2084174Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2084339Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2084621Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2085262Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2085902Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2086567Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2087199Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2087843Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2088486Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2089119Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2089766Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2090406Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2091036Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2091169Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:55.2091244Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2091302Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2091344Z unimplemented [] 2025-12-04T09:58:55.2091409Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2091510Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2092084Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2092122Z graph_break [] 2025-12-04T09:58:55.2092195Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2092235Z Autotune Choices Stats: 2025-12-04T09:58:55.2092994Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:55.2093122Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2093237Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2093418Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2094047Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2094653Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2095257Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2095868Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2096507Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2097134Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2097740Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2098379Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2098983Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2099588Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2099718Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:55.2099757Z Autotune Choices Stats: 2025-12-04T09:58:55.2100533Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.2100764Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2100928Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2101219Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2101855Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2102499Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2103136Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2103767Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2104400Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2105040Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2105682Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2106328Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2106997Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2107623Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2107755Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:55.2107833Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2107875Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2107913Z unimplemented [] 2025-12-04T09:58:55.2107977Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2108078Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2108653Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2108707Z graph_break [] 2025-12-04T09:58:55.2108781Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2108823Z Autotune Choices Stats: 2025-12-04T09:58:55.2109567Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:55.2109712Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2109828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2109987Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2110607Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2111226Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2111832Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2112443Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2113057Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2113678Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2114301Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2114914Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2115530Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2116176Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2116311Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:55.2116354Z Autotune Choices Stats: 2025-12-04T09:58:55.2117114Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.2117349Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2117513Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2117793Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2118445Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2119076Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2119730Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2120362Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2120990Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2121620Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2122257Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2122908Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2123554Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2124189Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2124317Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:55.2124392Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2124435Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2124478Z unimplemented [] 2025-12-04T09:58:55.2124541Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2124641Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2125219Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2125258Z graph_break [] 2025-12-04T09:58:55.2125331Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2125386Z Autotune Choices Stats: 2025-12-04T09:58:55.2126170Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.2126298Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2126415Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2126576Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2127212Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2127832Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2128448Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2129048Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2129655Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2130280Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2130879Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2131497Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2132114Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2132722Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2132849Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:55.2132890Z Autotune Choices Stats: 2025-12-04T09:58:55.2133654Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.2133876Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2134039Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2134325Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2134953Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2135593Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2136279Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2136922Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2137556Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2138195Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2138825Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2139472Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2140112Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2140754Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2140893Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:55.2140970Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2141012Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2141051Z unimplemented [] 2025-12-04T09:58:55.2141113Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2141219Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2141793Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2141835Z graph_break [] 2025-12-04T09:58:55.2141913Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2141954Z Autotune Choices Stats: 2025-12-04T09:58:55.2142700Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:55.2142839Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2142954Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2143113Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2143721Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2144338Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2144953Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2145575Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2146216Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2146820Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2147458Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2148067Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2148687Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2149312Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2149452Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:55.2149494Z Autotune Choices Stats: 2025-12-04T09:58:55.2150255Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.2150476Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2150642Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2150919Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2151572Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2152201Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2152842Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2153479Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2154122Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2154752Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2155378Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2156076Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2156706Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2157358Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2157484Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:55.2157560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2157616Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2157669Z unimplemented [] 2025-12-04T09:58:55.2157729Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2157831Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2158412Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2158447Z graph_break [] 2025-12-04T09:58:55.2158523Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2158563Z Autotune Choices Stats: 2025-12-04T09:58:55.2159310Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:55.2159437Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2159551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2159737Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2160348Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2160972Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2161572Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2162187Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2162807Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2163410Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2164017Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2164629Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2165238Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2165851Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2166032Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:55.2166073Z Autotune Choices Stats: 2025-12-04T09:58:55.2166852Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.2167079Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2167248Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2167530Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2168164Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2168815Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2169451Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2170095Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2170729Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2171367Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2171994Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2172623Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2173260Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2173890Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2174031Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:55.2174105Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2174150Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2174187Z unimplemented [] 2025-12-04T09:58:55.2174248Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2174346Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2174929Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2174980Z graph_break [] 2025-12-04T09:58:55.2175055Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2175095Z Autotune Choices Stats: 2025-12-04T09:58:55.2175838Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:55.2176045Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2176159Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2176320Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2176934Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2177552Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2178174Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2178781Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2179395Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2180012Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2180625Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2181233Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2181847Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2182464Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2182597Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:55.2182638Z Autotune Choices Stats: 2025-12-04T09:58:55.2183413Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.2183640Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2183803Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2184079Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2184716Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2185347Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2186036Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2186680Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2187311Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2187953Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2188595Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2189224Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2189855Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2190498Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2190628Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:55.2190705Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2190749Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2190787Z unimplemented [] 2025-12-04T09:58:55.2190849Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2190961Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2191533Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2191572Z graph_break [] 2025-12-04T09:58:55.2191645Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2191700Z Autotune Choices Stats: 2025-12-04T09:58:55.2192448Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.2192576Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2192689Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2192853Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2193461Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2194064Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2194690Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2195305Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2195912Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2196601Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2197206Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2197810Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2198415Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2199031Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2199160Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:55.2199202Z Autotune Choices Stats: 2025-12-04T09:58:55.2199990Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.2200206Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2200384Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2200683Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2201318Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2201944Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2202569Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2203214Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2203855Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2204483Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2205122Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2205763Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2206438Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2207068Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2207209Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:55.2207284Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2207328Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2207366Z unimplemented [] 2025-12-04T09:58:55.2207428Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2207527Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2208113Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2208153Z graph_break [] 2025-12-04T09:58:55.2208227Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2208267Z Autotune Choices Stats: 2025-12-04T09:58:55.2209015Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:55.2209163Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2209279Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2209440Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2210062Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2210668Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2211271Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2211885Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2212500Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2213117Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2213730Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2214338Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2214944Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2215550Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2215690Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:55.2215732Z Autotune Choices Stats: 2025-12-04T09:58:55.2216531Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.2216769Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2216934Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2217212Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2217860Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2218499Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2219127Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2219755Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2220397Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2221042Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2221667Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2222320Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2222948Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2223576Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2223704Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:55.2223791Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2223835Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2223872Z unimplemented [] 2025-12-04T09:58:55.2223932Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2224033Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2224609Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2224648Z graph_break [] 2025-12-04T09:58:55.2224723Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2224768Z Autotune Choices Stats: 2025-12-04T09:58:55.2225524Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:55.2225651Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2225767Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2225997Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2226610Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2227217Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2227816Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2228445Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2229048Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2229670Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2230282Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2230905Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2231512Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2232122Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2232249Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:55.2232300Z Autotune Choices Stats: 2025-12-04T09:58:55.2233061Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.2233280Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2233449Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2233738Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2234380Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2235031Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2235654Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2236320Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2236950Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2237590Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2238233Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2238884Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2239525Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2240152Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2240280Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:55.2240354Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2240395Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2240434Z unimplemented [] 2025-12-04T09:58:55.2240493Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2240595Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2241174Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2241224Z graph_break [] 2025-12-04T09:58:55.2241299Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2241340Z Autotune Choices Stats: 2025-12-04T09:58:55.2242085Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1708", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.2242229Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2242345Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2242504Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2243135Z triton_flex_attention_1708 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2243749Z triton_flex_attention_1709 0.0109 ms 96.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2244356Z triton_flex_attention_1707 0.0117 ms 89.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2244958Z triton_flex_attention_1705 0.0130 ms 80.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2245582Z triton_flex_attention_1724 0.0135 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2246229Z triton_flex_attention_1706 0.0136 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2246852Z triton_flex_attention_1716 0.0142 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2247468Z triton_flex_attention_1722 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2248088Z triton_flex_attention_1714 0.0149 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2248698Z triton_flex_attention_1720 0.0162 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2248828Z SingleProcess AUTOTUNE benchmarking takes 0.2434 seconds and 0.4106 seconds precompiling for 24 choices 2025-12-04T09:58:55.2248869Z Autotune Choices Stats: 2025-12-04T09:58:55.2249631Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015359999611973763, "best_triton_pos": 0} 2025-12-04T09:58:55.2249862Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2250031Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2250309Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2250956Z triton_flex_attention_backward_1743 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2251591Z triton_flex_attention_backward_1737 0.0181 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2252230Z triton_flex_attention_backward_1734 0.0187 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2252858Z triton_flex_attention_backward_1735 0.0188 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2253494Z triton_flex_attention_backward_1745 0.0203 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2254134Z triton_flex_attention_backward_1744 0.0203 ms 75.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2254763Z triton_flex_attention_backward_1742 0.0218 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2255401Z triton_flex_attention_backward_1747 0.0220 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2256093Z triton_flex_attention_backward_1738 0.0228 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2256734Z triton_flex_attention_backward_1729 0.0230 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2256861Z SingleProcess AUTOTUNE benchmarking takes 0.2527 seconds and 0.7882 seconds precompiling for 22 choices 2025-12-04T09:58:55.2256938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2256983Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2257020Z unimplemented [] 2025-12-04T09:58:55.2257081Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2257182Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2257760Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2257798Z graph_break [] 2025-12-04T09:58:55.2257891Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2257935Z Autotune Choices Stats: 2025-12-04T09:58:55.2258683Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1754", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009398999623954296, "best_triton_pos": 0} 2025-12-04T09:58:55.2258813Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2258932Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2259095Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2259714Z triton_flex_attention_1754 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2260327Z triton_flex_attention_1755 0.0104 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2260946Z triton_flex_attention_1752 0.0112 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2261552Z triton_flex_attention_1753 0.0117 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2262155Z triton_flex_attention_1750 0.0120 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2262775Z triton_flex_attention_1770 0.0132 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2263385Z triton_flex_attention_1751 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2264000Z triton_flex_attention_1762 0.0140 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2264615Z triton_flex_attention_1768 0.0146 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2265231Z triton_flex_attention_1760 0.0150 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2265358Z SingleProcess AUTOTUNE benchmarking takes 0.2227 seconds and 0.4678 seconds precompiling for 24 choices 2025-12-04T09:58:55.2265402Z Autotune Choices Stats: 2025-12-04T09:58:55.2266249Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.2266464Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2266654Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2266935Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2267572Z triton_flex_attention_backward_1789 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2268216Z triton_flex_attention_backward_1783 0.0184 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2268854Z triton_flex_attention_backward_1780 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2269494Z triton_flex_attention_backward_1781 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2270135Z triton_flex_attention_backward_1791 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2270766Z triton_flex_attention_backward_1790 0.0204 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2271405Z triton_flex_attention_backward_1788 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2272032Z triton_flex_attention_backward_1793 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2275564Z triton_flex_attention_backward_1784 0.0226 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2276282Z triton_flex_attention_backward_1775 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2276431Z SingleProcess AUTOTUNE benchmarking takes 0.2632 seconds and 0.8758 seconds precompiling for 22 choices 2025-12-04T09:58:55.2276505Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2276549Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2276585Z unimplemented [] 2025-12-04T09:58:55.2276647Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2276746Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2277317Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2277357Z graph_break [] 2025-12-04T09:58:55.2277433Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2277472Z Autotune Choices Stats: 2025-12-04T09:58:55.2278215Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1801", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.2278360Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2278472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2278634Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2279259Z triton_flex_attention_1801 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2279872Z triton_flex_attention_1800 0.0108 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2280492Z triton_flex_attention_1816 0.0128 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2281105Z triton_flex_attention_1798 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2281709Z triton_flex_attention_1797 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2282315Z triton_flex_attention_1808 0.0133 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2282933Z triton_flex_attention_1814 0.0140 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2283536Z triton_flex_attention_1806 0.0150 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2284151Z triton_flex_attention_1799 0.0158 ms 66.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2284765Z triton_flex_attention_1812 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2284906Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4169 seconds precompiling for 24 choices 2025-12-04T09:58:55.2284946Z Autotune Choices Stats: 2025-12-04T09:58:55.2285703Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1835", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.2285989Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2286157Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2286436Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2287084Z triton_flex_attention_backward_1835 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2287712Z triton_flex_attention_backward_1829 0.0184 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2288350Z triton_flex_attention_backward_1826 0.0186 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2288988Z triton_flex_attention_backward_1827 0.0186 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2289629Z triton_flex_attention_backward_1837 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2290257Z triton_flex_attention_backward_1836 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2290886Z triton_flex_attention_backward_1834 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2291531Z triton_flex_attention_backward_1839 0.0221 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2292166Z triton_flex_attention_backward_1830 0.0228 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2292791Z triton_flex_attention_backward_1821 0.0230 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2292920Z SingleProcess AUTOTUNE benchmarking takes 0.2624 seconds and 0.8439 seconds precompiling for 22 choices 2025-12-04T09:58:55.2293021Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2293066Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2293102Z unimplemented [] 2025-12-04T09:58:55.2293165Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2293264Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2293839Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2293878Z graph_break [] 2025-12-04T09:58:55.2293953Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2293995Z Autotune Choices Stats: 2025-12-04T09:58:55.2294742Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1846", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009399999864399433, "best_triton_pos": 0} 2025-12-04T09:58:55.2294871Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2294996Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2295164Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2295777Z triton_flex_attention_1846 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2296447Z triton_flex_attention_1844 0.0102 ms 91.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2297047Z triton_flex_attention_1845 0.0120 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2297663Z triton_flex_attention_1843 0.0130 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2298283Z triton_flex_attention_1854 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2298889Z triton_flex_attention_1862 0.0134 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2299493Z triton_flex_attention_1842 0.0137 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2300117Z triton_flex_attention_1847 0.0138 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2300731Z triton_flex_attention_1860 0.0144 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2301337Z triton_flex_attention_1852 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2301467Z SingleProcess AUTOTUNE benchmarking takes 0.2274 seconds and 0.3833 seconds precompiling for 24 choices 2025-12-04T09:58:55.2301528Z Autotune Choices Stats: 2025-12-04T09:58:55.2302293Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01583999954164028, "best_triton_pos": 0} 2025-12-04T09:58:55.2302509Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2302676Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2302955Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2303592Z triton_flex_attention_backward_1881 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2304226Z triton_flex_attention_backward_1875 0.0184 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2304867Z triton_flex_attention_backward_1873 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2305492Z triton_flex_attention_backward_1872 0.0188 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2306195Z triton_flex_attention_backward_1883 0.0201 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2306837Z triton_flex_attention_backward_1882 0.0202 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2307463Z triton_flex_attention_backward_1880 0.0220 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2308095Z triton_flex_attention_backward_1885 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2308735Z triton_flex_attention_backward_1876 0.0224 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2309383Z triton_flex_attention_backward_1867 0.0232 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2309511Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.7872 seconds precompiling for 22 choices 2025-12-04T09:58:55.2309604Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:55.2309654Z Traceback (most recent call last): 2025-12-04T09:58:55.2309807Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:55.2309849Z self.assertTrue( 2025-12-04T09:58:55.2309954Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:55.2310018Z raise self.failureException(msg) 2025-12-04T09:58:55.2310150Z AssertionError: False is not true : Log file /tmp/tmpw_3v__bo/flex_attention_configs.json was not created 2025-12-04T09:58:55.2310156Z 2025-12-04T09:58:55.2310232Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.2310398Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.2310400Z 2025-12-04T09:58:55.2310491Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.2310566Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2310608Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2310645Z unimplemented [] 2025-12-04T09:58:55.2310706Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2311291Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:55.2311392Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2311431Z graph_break [] 2025-12-04T09:58:55.2311504Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2311998Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:55.2312061Z current_size = base.storage().size() 2025-12-04T09:58:55.2312105Z Autotune Choices Stats: 2025-12-04T09:58:55.2312849Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:55.2312977Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2313096Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2313266Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2313875Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2314486Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2315099Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2315705Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2316354Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2316964Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2317569Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2318186Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2318799Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2319405Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2319535Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:55.2319578Z Autotune Choices Stats: 2025-12-04T09:58:55.2320345Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.2320563Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2320739Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2321013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2321648Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2322285Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2322915Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2323546Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2324176Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2324807Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2325447Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2326155Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2326799Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2327434Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2327575Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:55.2327651Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2327693Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2327732Z unimplemented [] 2025-12-04T09:58:55.2327793Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2327894Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2328471Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2328508Z graph_break [] 2025-12-04T09:58:55.2328582Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2328622Z Autotune Choices Stats: 2025-12-04T09:58:55.2329368Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:55.2329508Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2329622Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2329781Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2330401Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2331006Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2331628Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2332230Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2332834Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2333436Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2334050Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2334660Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2335261Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2335881Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2336059Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:55.2336100Z Autotune Choices Stats: 2025-12-04T09:58:55.2336856Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.2337072Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2337236Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2337512Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2338165Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2338789Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2339425Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2340059Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2340709Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2341337Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2343895Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2344548Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2345183Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2345810Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2345995Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:55.2346112Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2346156Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2346194Z unimplemented [] 2025-12-04T09:58:55.2346260Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2346366Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2346943Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2346981Z graph_break [] 2025-12-04T09:58:55.2347058Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2347099Z Autotune Choices Stats: 2025-12-04T09:58:55.2347837Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:55.2347964Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2348093Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2348257Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2348866Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2349477Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2350077Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2350687Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2351309Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2351913Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2352514Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2353124Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2353736Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2354339Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2354471Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:55.2354523Z Autotune Choices Stats: 2025-12-04T09:58:55.2355288Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.2355508Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2355678Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2355983Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2356616Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2357256Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2357895Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2358519Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2359164Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2359810Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2360436Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2361061Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2361699Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2362329Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2362461Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:55.2362536Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2362579Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2362615Z unimplemented [] 2025-12-04T09:58:55.2362677Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2362776Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2363360Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2363406Z graph_break [] 2025-12-04T09:58:55.2363483Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2363523Z Autotune Choices Stats: 2025-12-04T09:58:55.2364264Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:55.2364393Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2364506Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2364666Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2365276Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2365888Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2366540Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2367142Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2367768Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2368372Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2368971Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2369566Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2370178Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2370789Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2370919Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:55.2370959Z Autotune Choices Stats: 2025-12-04T09:58:55.2371730Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.2371958Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2372122Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2372397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2373033Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2373659Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2374293Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2374932Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2375564Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2376250Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2376879Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2377512Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2378138Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2378776Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2378905Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:55.2378982Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2379025Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2379061Z unimplemented [] 2025-12-04T09:58:55.2379133Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2379232Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2379811Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2379850Z graph_break [] 2025-12-04T09:58:55.2379923Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2379982Z Autotune Choices Stats: 2025-12-04T09:58:55.2380718Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.2380844Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2380960Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2381122Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2381732Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2382337Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2382945Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2383557Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2384158Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2384790Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2385387Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2386031Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2386634Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2387249Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2387377Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:55.2387418Z Autotune Choices Stats: 2025-12-04T09:58:55.2388189Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.2388407Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2388584Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2388874Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2389504Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2390129Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2390750Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2391396Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2392035Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2392661Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2393315Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2393938Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2394564Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2395187Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2395325Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:55.2395398Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2395441Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2395477Z unimplemented [] 2025-12-04T09:58:55.2395539Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2395637Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2396280Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2396319Z graph_break [] 2025-12-04T09:58:55.2396393Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2396435Z Autotune Choices Stats: 2025-12-04T09:58:55.2397195Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.2397333Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2397446Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2397609Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2398223Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2398830Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2399435Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2400055Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2400667Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2401277Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2401889Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2402493Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2403096Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2403695Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2403834Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:55.2403873Z Autotune Choices Stats: 2025-12-04T09:58:55.2404636Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.2404861Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2405025Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2405302Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2406092Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2406718Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2407342Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2407968Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2408608Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2409250Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2409883Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2410524Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2411158Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2411779Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2411908Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:55.2411996Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2412038Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2412076Z unimplemented [] 2025-12-04T09:58:55.2412136Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2412237Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2412812Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2412849Z graph_break [] 2025-12-04T09:58:55.2412923Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2412965Z Autotune Choices Stats: 2025-12-04T09:58:55.2413714Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:55.2413840Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2413967Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2414137Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2414746Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2415353Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2415978Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2416588Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2417190Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2417815Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2418429Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2419041Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2419637Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2420241Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2420370Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:55.2420422Z Autotune Choices Stats: 2025-12-04T09:58:55.2421174Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.2421392Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2421558Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2421847Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2422498Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2423132Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2423755Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2424385Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2425012Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2425650Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2426333Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2426975Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2427614Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2428245Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2428376Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:55.2428456Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2428498Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2428536Z unimplemented [] 2025-12-04T09:58:55.2428596Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2428696Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2429265Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2429316Z graph_break [] 2025-12-04T09:58:55.2429389Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2429430Z Autotune Choices Stats: 2025-12-04T09:58:55.2430183Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:55.2430310Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2430426Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2430587Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2431217Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2431830Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2432432Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2433041Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2433665Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2434270Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2434880Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2435489Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2436128Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2436730Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2436860Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:55.2436901Z Autotune Choices Stats: 2025-12-04T09:58:55.2437655Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.2437888Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2438053Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2438330Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2438986Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2439618Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2440250Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2440874Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2441502Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2442143Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2442765Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2443403Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2444037Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2444675Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2444802Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:55.2444877Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2444921Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2444959Z unimplemented [] 2025-12-04T09:58:55.2445019Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2445120Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2445695Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2445732Z graph_break [] 2025-12-04T09:58:55.2445816Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2445857Z Autotune Choices Stats: 2025-12-04T09:58:55.2446661Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.2446787Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2446904Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2447065Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2447689Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2448309Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2448927Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2449531Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2450141Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2450758Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2451364Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2451974Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2452583Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2453200Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2453327Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:55.2453369Z Autotune Choices Stats: 2025-12-04T09:58:55.2454131Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.2454346Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2454528Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2454809Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2455446Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2456120Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2456757Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2457386Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2458017Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2458642Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2459286Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2459915Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2460554Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2461187Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2461326Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:55.2461403Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2461445Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2461483Z unimplemented [] 2025-12-04T09:58:55.2461544Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2461643Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2462228Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2462265Z graph_break [] 2025-12-04T09:58:55.2462339Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2462378Z Autotune Choices Stats: 2025-12-04T09:58:55.2463119Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:55.2463255Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2463368Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2463528Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2464151Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2464757Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2465368Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2466015Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2466622Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2467226Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2467842Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2468446Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2469062Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2469677Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2469832Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:55.2469873Z Autotune Choices Stats: 2025-12-04T09:58:55.2470638Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:55.2470856Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2471023Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2471303Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2471946Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2472571Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2473213Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2473852Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2474488Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2475117Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2475742Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2476434Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2477049Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2477697Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2477827Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:55.2477905Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2477974Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2478012Z unimplemented [] 2025-12-04T09:58:55.2478073Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2478173Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2478752Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2478789Z graph_break [] 2025-12-04T09:58:55.2478864Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2478905Z Autotune Choices Stats: 2025-12-04T09:58:55.2479652Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:55.2479781Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2479910Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2480073Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2480684Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2481295Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2481900Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2482516Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2483125Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2483739Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2484353Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2484970Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2485586Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2486232Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2486364Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:55.2486405Z Autotune Choices Stats: 2025-12-04T09:58:55.2487205Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:55.2487424Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2487590Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2487869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2488505Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2489151Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2489777Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2490417Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2491062Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2491701Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2492327Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2492950Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2493594Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2494228Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2494359Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:55.2494433Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2494477Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2494513Z unimplemented [] 2025-12-04T09:58:55.2494574Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2494675Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2495262Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2495312Z graph_break [] 2025-12-04T09:58:55.2495386Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2495428Z Autotune Choices Stats: 2025-12-04T09:58:55.2496206Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:55.2496338Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2496452Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2496617Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2497235Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2497850Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2498469Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2499068Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2499684Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2500298Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2500906Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2501513Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2502121Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2502743Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2502874Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:55.2502913Z Autotune Choices Stats: 2025-12-04T09:58:55.2503692Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:55.2503922Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2504086Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2504368Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2505004Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2505629Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2506306Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2506947Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2507578Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2508218Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2508859Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2509492Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2510123Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2510759Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2510887Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:55.2510962Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2511009Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2511047Z unimplemented [] 2025-12-04T09:58:55.2511108Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2511220Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2511800Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2511843Z graph_break [] 2025-12-04T09:58:55.2511918Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2511970Z Autotune Choices Stats: 2025-12-04T09:58:55.2512719Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:55.2512850Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2512967Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2513129Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2513742Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2514344Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2514958Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2515580Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2516248Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2516873Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2517478Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2518086Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2518691Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2519313Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2519444Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:55.2519485Z Autotune Choices Stats: 2025-12-04T09:58:55.2520260Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.2520481Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2520658Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2520949Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2521584Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2522213Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2522841Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2523483Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2524125Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2524757Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2525396Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2526082Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2526712Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2527345Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2527488Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:55.2527562Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2527604Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2527642Z unimplemented [] 2025-12-04T09:58:55.2527703Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2527803Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2528387Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2528427Z graph_break [] 2025-12-04T09:58:55.2528500Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2528540Z Autotune Choices Stats: 2025-12-04T09:58:55.2529294Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:55.2529437Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2529553Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2529715Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2530327Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2530932Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2531542Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2532161Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2532778Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2533394Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2534989Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2535601Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2536829Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2537436Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2537583Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:55.2537624Z Autotune Choices Stats: 2025-12-04T09:58:55.2538382Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:55.2538604Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2538770Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2539049Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2539687Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2540368Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2540998Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2541637Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2542280Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2542907Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2543534Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2544173Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2544816Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2545461Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2545593Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:55.2545681Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2545725Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2545763Z unimplemented [] 2025-12-04T09:58:55.2545825Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2545959Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2546546Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2546582Z graph_break [] 2025-12-04T09:58:55.2546658Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2546701Z Autotune Choices Stats: 2025-12-04T09:58:55.2547445Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:55.2547572Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2547686Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2547861Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2548487Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2549093Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2549717Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2550321Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2550947Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2551556Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2552159Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2552783Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2553394Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2554011Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2554141Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:55.2554191Z Autotune Choices Stats: 2025-12-04T09:58:55.2554946Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.2555164Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2555330Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2555611Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2556308Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2556945Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2557589Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2558231Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2558859Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2559497Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2560121Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2560746Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2561393Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2562017Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2562147Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:55.2562222Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2562265Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2562304Z unimplemented [] 2025-12-04T09:58:55.2562367Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2562477Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2563051Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2563099Z graph_break [] 2025-12-04T09:58:55.2563173Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2563215Z Autotune Choices Stats: 2025-12-04T09:58:55.2563947Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.2564075Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2564188Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2564351Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2564962Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2565589Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2566234Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2566862Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2567473Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2568077Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2568683Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2569296Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2569920Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2570521Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2570655Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:55.2570694Z Autotune Choices Stats: 2025-12-04T09:58:55.2571467Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.2571692Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2571859Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2572138Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2572769Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2573398Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2574048Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2574671Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2575313Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2575988Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2576628Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2577257Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2577891Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2578540Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2578670Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:55.2578748Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2578791Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2578830Z unimplemented [] 2025-12-04T09:58:55.2578889Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2578988Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2579581Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2579619Z graph_break [] 2025-12-04T09:58:55.2579691Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2579743Z Autotune Choices Stats: 2025-12-04T09:58:55.2580487Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.2580615Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2580732Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2580894Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2581509Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2582112Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2582747Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2583348Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2583964Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2584580Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2585185Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2585789Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2586436Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2587070Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2587199Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:55.2587242Z Autotune Choices Stats: 2025-12-04T09:58:55.2588003Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.2588236Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2588414Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2588691Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2589320Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2589944Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2590565Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2591216Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2591844Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2592485Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2593119Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2593749Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2594376Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2595001Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2595138Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:55.2595212Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2595254Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2595293Z unimplemented [] 2025-12-04T09:58:55.2595364Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2595466Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2596101Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2596139Z graph_break [] 2025-12-04T09:58:55.2596213Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2596253Z Autotune Choices Stats: 2025-12-04T09:58:55.2597014Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:55.2597152Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2597266Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2597425Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2598036Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2598643Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2599243Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2599878Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2600482Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2601094Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2601706Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2602313Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2602914Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2603517Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2603658Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:55.2603700Z Autotune Choices Stats: 2025-12-04T09:58:55.2604464Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.2604683Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2604847Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2605129Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2605768Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2606427Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2607052Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2607672Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2608340Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2608970Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2609612Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2610255Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2610880Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2611513Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2611641Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:55.2611717Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2611769Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2611807Z unimplemented [] 2025-12-04T09:58:55.2611868Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2611967Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2612550Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2612591Z graph_break [] 2025-12-04T09:58:55.2612669Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2612712Z Autotune Choices Stats: 2025-12-04T09:58:55.2613457Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.2613592Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2613710Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2613892Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2614506Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2615206Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2615818Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2616460Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2617092Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2617692Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2618307Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2618920Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2619516Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2620120Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2620249Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:55.2620292Z Autotune Choices Stats: 2025-12-04T09:58:55.2621067Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.2621284Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2621449Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2621728Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2622381Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2623016Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2623643Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2624265Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2624899Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2625546Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2626209Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2626849Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2627492Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2628116Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2628244Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:55.2628320Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2628363Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2628400Z unimplemented [] 2025-12-04T09:58:55.2628461Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2628560Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2629137Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2629187Z graph_break [] 2025-12-04T09:58:55.2629263Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2629302Z Autotune Choices Stats: 2025-12-04T09:58:55.2630050Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.2630180Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2630294Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2630456Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2631085Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2631706Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2632307Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2632910Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2633513Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2634140Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2634737Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2635345Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2636000Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2636604Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2636733Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:55.2636772Z Autotune Choices Stats: 2025-12-04T09:58:55.2637530Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:55.2637761Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2637944Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2638224Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2638854Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2639492Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2640135Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2640756Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2641381Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2642009Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2642657Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2643285Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2643920Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2644555Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2644685Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:55.2644758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2644803Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2644841Z unimplemented [] 2025-12-04T09:58:55.2644902Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2645000Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2645578Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2645614Z graph_break [] 2025-12-04T09:58:55.2645687Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2645727Z Autotune Choices Stats: 2025-12-04T09:58:55.2646553Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.2646682Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2646795Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2646956Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2647567Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2648173Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2648794Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2649399Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2650001Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2650607Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2651224Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2651829Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2652443Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2653052Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2653181Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:55.2653221Z Autotune Choices Stats: 2025-12-04T09:58:55.2653981Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.2654197Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2654360Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2654647Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2655285Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2655913Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2656593Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2657231Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2657864Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2658492Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2659118Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2659767Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2660397Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2661031Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2661168Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:55.2661242Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2661284Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2661320Z unimplemented [] 2025-12-04T09:58:55.2661383Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2661481Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2662058Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2662097Z graph_break [] 2025-12-04T09:58:55.2662171Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2662213Z Autotune Choices Stats: 2025-12-04T09:58:55.2662953Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:55.2663096Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2663209Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2663376Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2663990Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2664604Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2665206Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2665820Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2666489Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2667095Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2667716Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2668334Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2668938Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2669557Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2669700Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:55.2669741Z Autotune Choices Stats: 2025-12-04T09:58:55.2670502Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:55.2670720Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2670885Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2671166Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2671808Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2672448Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2673074Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2673704Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2674344Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2674971Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2675602Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2676285Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2676927Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2677551Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2677694Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:55.2677768Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2677830Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2677866Z unimplemented [] 2025-12-04T09:58:55.2677930Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2678029Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2678611Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2678648Z graph_break [] 2025-12-04T09:58:55.2678720Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2678761Z Autotune Choices Stats: 2025-12-04T09:58:55.2679498Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:55.2679624Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2679738Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2679919Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2680552Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2681152Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2681771Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2682380Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2682996Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2683600Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2684213Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2684843Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2685447Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2686105Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2686235Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:55.2686288Z Autotune Choices Stats: 2025-12-04T09:58:55.2687047Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.2687264Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2687428Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2687709Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2688342Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2688984Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2689623Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2690262Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2690890Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2691533Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2692157Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2692783Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2693430Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2694059Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2694189Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:55.2694264Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2694306Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2694343Z unimplemented [] 2025-12-04T09:58:55.2694404Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2694520Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2695087Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2695135Z graph_break [] 2025-12-04T09:58:55.2695209Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2695251Z Autotune Choices Stats: 2025-12-04T09:58:55.2696047Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:55.2696174Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2696289Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2696450Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2697058Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2697683Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2698290Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2698906Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2699527Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2700134Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2700736Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2701339Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2701955Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2702565Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2702694Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:55.2702735Z Autotune Choices Stats: 2025-12-04T09:58:55.2703502Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.2703729Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2703894Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2704171Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2704812Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2705442Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2706140Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2706771Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2707424Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2708051Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2708686Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2709318Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2709948Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2710603Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2710732Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:55.2710810Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2710852Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2710889Z unimplemented [] 2025-12-04T09:58:55.2710949Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2711051Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2711638Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2711676Z graph_break [] 2025-12-04T09:58:55.2711749Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2711801Z Autotune Choices Stats: 2025-12-04T09:58:55.2712538Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.2712665Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2712779Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2712939Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2713558Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2714164Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2714787Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2715389Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2716134Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2716748Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2717350Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2717957Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2718564Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2719191Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2719319Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:55.2719361Z Autotune Choices Stats: 2025-12-04T09:58:55.2720125Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.2720354Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2720528Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2720802Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2721434Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2722060Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2722688Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2723332Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2723960Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2724604Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2725238Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2725866Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2726542Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2727170Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2727327Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:55.2727403Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2727446Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2727503Z unimplemented [] 2025-12-04T09:58:55.2727563Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2727667Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2728246Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2728283Z graph_break [] 2025-12-04T09:58:55.2728359Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2728398Z Autotune Choices Stats: 2025-12-04T09:58:55.2729160Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:55.2729301Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2729415Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2729579Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2730185Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2730787Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2731392Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2732020Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2732625Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2733237Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2733855Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2734456Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2735060Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2735661Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2735799Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:55.2735839Z Autotune Choices Stats: 2025-12-04T09:58:55.2736661Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.2736879Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2737045Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2737334Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2737980Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2738609Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2739237Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2739861Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2740526Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2741155Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2741790Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2742447Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2743079Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2743711Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2743841Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:55.2743917Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2743971Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2744007Z unimplemented [] 2025-12-04T09:58:55.2744069Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2744167Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2744755Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2744791Z graph_break [] 2025-12-04T09:58:55.2744868Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2744909Z Autotune Choices Stats: 2025-12-04T09:58:55.2745667Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.2745795Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2745917Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2746117Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2746730Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2747347Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2747960Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2748570Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2749210Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2749813Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2750430Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2751047Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2751652Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2752263Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2752393Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:55.2752447Z Autotune Choices Stats: 2025-12-04T09:58:55.2753220Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.2753435Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2753603Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2753883Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2754532Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2755173Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2755802Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2756494Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2757123Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2757792Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2758420Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2759077Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2759719Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2760349Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2760479Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:55.2760553Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2760596Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2760636Z unimplemented [] 2025-12-04T09:58:55.2760697Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2760797Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2761373Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2761423Z graph_break [] 2025-12-04T09:58:55.2761504Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2761543Z Autotune Choices Stats: 2025-12-04T09:58:55.2762299Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:55.2762431Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2762545Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2762707Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2763334Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2763946Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2764553Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2765159Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2765774Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2766436Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2767041Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2767663Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2768279Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2768882Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2769011Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:55.2769052Z Autotune Choices Stats: 2025-12-04T09:58:55.2769813Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.2770045Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2770218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2770493Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2771129Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2771766Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2772405Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2773024Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2773648Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2774279Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2774934Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2775565Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2776258Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2776895Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2777025Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:55.2777101Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2777144Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2777181Z unimplemented [] 2025-12-04T09:58:55.2777245Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2777345Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2777925Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2777964Z graph_break [] 2025-12-04T09:58:55.2778037Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2778101Z Autotune Choices Stats: 2025-12-04T09:58:55.2778855Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:55.2778985Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2779098Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2779263Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2779881Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2780484Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2781102Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2781705Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2782309Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2782923Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2783539Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2784157Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2784758Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2785374Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2785502Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:55.2785543Z Autotune Choices Stats: 2025-12-04T09:58:55.2786349Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.2786564Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2786747Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2787026Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2787676Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2788303Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2788938Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2789581Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2790213Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2790847Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2791495Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2792137Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2792778Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2793401Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2793542Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:55.2793620Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2793662Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2793704Z unimplemented [] 2025-12-04T09:58:55.2793766Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2793867Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2794450Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2794491Z graph_break [] 2025-12-04T09:58:55.2794565Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2794607Z Autotune Choices Stats: 2025-12-04T09:58:55.2795350Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:55.2795491Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2795606Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2795776Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2796434Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2797058Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2797672Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2798276Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2798886Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2799489Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2800130Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2800731Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2801347Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2801947Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2802095Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:55.2802137Z Autotune Choices Stats: 2025-12-04T09:58:55.2802898Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.2803123Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2803294Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2803570Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2804215Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2804855Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2805497Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2806163Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2806816Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2807446Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2808075Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2808730Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2809358Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2810002Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2810132Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:55.2810219Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2810261Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2810300Z unimplemented [] 2025-12-04T09:58:55.2810361Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2810461Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2811034Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2811073Z graph_break [] 2025-12-04T09:58:55.2811146Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2811186Z Autotune Choices Stats: 2025-12-04T09:58:55.2811925Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.2812052Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2812169Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2812337Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2812969Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2813574Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2814192Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2814809Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2815411Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2816079Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2816682Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2817323Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2817925Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2818541Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2818682Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:55.2818725Z Autotune Choices Stats: 2025-12-04T09:58:55.2819482Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.2819699Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2819865Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2820142Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2820784Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2821438Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2822056Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2822827Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2823464Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2824106Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2824728Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2825355Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2826060Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2826687Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2826815Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:55.2826890Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2826934Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2826975Z unimplemented [] 2025-12-04T09:58:55.2827048Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2827150Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2827742Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2827780Z graph_break [] 2025-12-04T09:58:55.2827857Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2827900Z Autotune Choices Stats: 2025-12-04T09:58:55.2828646Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:55.2828771Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2828886Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2829048Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2829655Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2830285Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2830891Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2831503Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2832118Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2832726Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2833335Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2833940Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2834571Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2835176Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2835307Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:55.2835350Z Autotune Choices Stats: 2025-12-04T09:58:55.2836162Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.2836392Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2836561Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2836840Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2837476Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2838102Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2838779Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2839405Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2840050Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2840687Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2841317Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2841947Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2842574Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2843222Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2843354Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:55.2843430Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2843475Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2843511Z unimplemented [] 2025-12-04T09:58:55.2843571Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2843670Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2844253Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2844302Z graph_break [] 2025-12-04T09:58:55.2844381Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2844422Z Autotune Choices Stats: 2025-12-04T09:58:55.2845167Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:55.2845294Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2845410Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2845572Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2846235Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2846841Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2847485Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2848094Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2848710Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2849329Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2853597Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2854224Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2854827Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2855465Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2855600Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:55.2855642Z Autotune Choices Stats: 2025-12-04T09:58:55.2856483Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.2856703Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2856887Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2857166Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2857802Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2858433Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2859064Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2859726Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2860353Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2860995Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2861631Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2862259Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2862887Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2863511Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2863654Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:55.2863741Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2863785Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2863824Z unimplemented [] 2025-12-04T09:58:55.2863886Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2863989Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2864563Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2864600Z graph_break [] 2025-12-04T09:58:55.2864678Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2864719Z Autotune Choices Stats: 2025-12-04T09:58:55.2865476Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:55.2865617Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2865733Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2865896Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2866534Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2867143Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2867767Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2868382Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2868982Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2869603Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2870224Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2870825Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2871429Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2872036Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2872175Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:55.2872224Z Autotune Choices Stats: 2025-12-04T09:58:55.2872978Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.2873197Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2873374Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2873649Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2874303Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2874928Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2875558Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2876225Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2876880Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2877509Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2878150Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2878786Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2879410Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2880039Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2880168Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:55.2880253Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2880299Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2880336Z unimplemented [] 2025-12-04T09:58:55.2880398Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2880499Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2881089Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2881128Z graph_break [] 2025-12-04T09:58:55.2881201Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2881241Z Autotune Choices Stats: 2025-12-04T09:58:55.2881986Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.2882113Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2882237Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2882397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2883012Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2883618Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2884222Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2884838Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2885461Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2886096Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2886719Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2887337Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2887940Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2888546Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2888688Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:55.2888726Z Autotune Choices Stats: 2025-12-04T09:58:55.2889510Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.2889729Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2889894Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2890170Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2890817Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2891455Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2892082Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2892715Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2893347Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2893997Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2894623Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2895261Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2895900Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2896580Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2896710Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:55.2896785Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2896830Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2896868Z unimplemented [] 2025-12-04T09:58:55.2896932Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2897031Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2897623Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2897660Z graph_break [] 2025-12-04T09:58:55.2897746Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2897787Z Autotune Choices Stats: 2025-12-04T09:58:55.2898533Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:55.2898661Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2898776Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2898949Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2899559Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2900175Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2900781Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2901384Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2902014Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2902616Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2903235Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2903837Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2904453Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2905055Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2905185Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:55.2905227Z Autotune Choices Stats: 2025-12-04T09:58:55.2906023Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.2906255Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2906442Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2906716Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2907351Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2908087Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2908725Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2909354Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2909992Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2910628Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2911266Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2911908Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2912539Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2913174Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2913303Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:55.2913379Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2913420Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2913458Z unimplemented [] 2025-12-04T09:58:55.2913519Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2913617Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2914197Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2914246Z graph_break [] 2025-12-04T09:58:55.2914319Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2914359Z Autotune Choices Stats: 2025-12-04T09:58:55.2915106Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:55.2915233Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2915349Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2915505Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2916167Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2916786Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2917391Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2917997Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2918603Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2919232Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2919832Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2920446Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2921051Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2921664Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2921794Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:55.2921837Z Autotune Choices Stats: 2025-12-04T09:58:55.2922598Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.2922814Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2922998Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2923284Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2923919Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2924562Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2925189Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2925828Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2926492Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2927114Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2927760Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2928392Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2929034Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2929675Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2929803Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:55.2929878Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2929920Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2929957Z unimplemented [] 2025-12-04T09:58:55.2930017Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2930119Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2930692Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2930730Z graph_break [] 2025-12-04T09:58:55.2930806Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2930845Z Autotune Choices Stats: 2025-12-04T09:58:55.2931587Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1708", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.2931724Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2931850Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2932012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2932629Z triton_flex_attention_1708 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2933249Z triton_flex_attention_1709 0.0109 ms 96.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2933861Z triton_flex_attention_1707 0.0117 ms 89.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2934470Z triton_flex_attention_1705 0.0130 ms 80.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2935078Z triton_flex_attention_1724 0.0135 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2935679Z triton_flex_attention_1706 0.0136 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2936343Z triton_flex_attention_1716 0.0142 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2936950Z triton_flex_attention_1722 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2937568Z triton_flex_attention_1714 0.0149 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2938192Z triton_flex_attention_1720 0.0162 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2938321Z SingleProcess AUTOTUNE benchmarking takes 0.2434 seconds and 0.4106 seconds precompiling for 24 choices 2025-12-04T09:58:55.2938361Z Autotune Choices Stats: 2025-12-04T09:58:55.2939129Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015359999611973763, "best_triton_pos": 0} 2025-12-04T09:58:55.2939345Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2939513Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2939790Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2940450Z triton_flex_attention_backward_1743 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2941083Z triton_flex_attention_backward_1737 0.0181 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2941723Z triton_flex_attention_backward_1734 0.0187 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2942356Z triton_flex_attention_backward_1735 0.0188 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2942988Z triton_flex_attention_backward_1745 0.0203 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2943619Z triton_flex_attention_backward_1744 0.0203 ms 75.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2944242Z triton_flex_attention_backward_1742 0.0218 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2944889Z triton_flex_attention_backward_1747 0.0220 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2945520Z triton_flex_attention_backward_1738 0.0228 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2946201Z triton_flex_attention_backward_1729 0.0230 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2946339Z SingleProcess AUTOTUNE benchmarking takes 0.2527 seconds and 0.7882 seconds precompiling for 22 choices 2025-12-04T09:58:55.2946413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2946455Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2946492Z unimplemented [] 2025-12-04T09:58:55.2946552Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2946652Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2947229Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.2947266Z graph_break [] 2025-12-04T09:58:55.2947339Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2947378Z Autotune Choices Stats: 2025-12-04T09:58:55.2948125Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1754", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009398999623954296, "best_triton_pos": 0} 2025-12-04T09:58:55.2948250Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2948378Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2948537Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2949161Z triton_flex_attention_1754 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2949764Z triton_flex_attention_1755 0.0104 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2950379Z triton_flex_attention_1752 0.0112 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2950989Z triton_flex_attention_1753 0.0117 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2951593Z triton_flex_attention_1750 0.0120 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2952207Z triton_flex_attention_1770 0.0132 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2952810Z triton_flex_attention_1751 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2953441Z triton_flex_attention_1762 0.0140 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2954046Z triton_flex_attention_1768 0.0146 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2954668Z triton_flex_attention_1760 0.0150 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2954806Z SingleProcess AUTOTUNE benchmarking takes 0.2227 seconds and 0.4678 seconds precompiling for 24 choices 2025-12-04T09:58:55.2954845Z Autotune Choices Stats: 2025-12-04T09:58:55.2955610Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.2955826Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2956046Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2956327Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2956960Z triton_flex_attention_backward_1789 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2957623Z triton_flex_attention_backward_1783 0.0184 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2958240Z triton_flex_attention_backward_1780 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2958880Z triton_flex_attention_backward_1781 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2959523Z triton_flex_attention_backward_1791 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2960152Z triton_flex_attention_backward_1790 0.0204 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2960778Z triton_flex_attention_backward_1788 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2961407Z triton_flex_attention_backward_1793 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2962054Z triton_flex_attention_backward_1784 0.0226 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2962678Z triton_flex_attention_backward_1775 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2962807Z SingleProcess AUTOTUNE benchmarking takes 0.2632 seconds and 0.8758 seconds precompiling for 22 choices 2025-12-04T09:58:55.2962880Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2962935Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2962972Z unimplemented [] 2025-12-04T09:58:55.2963033Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2963142Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2963717Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2963752Z graph_break [] 2025-12-04T09:58:55.2963827Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2963867Z Autotune Choices Stats: 2025-12-04T09:58:55.2964618Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1801", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.2964748Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2964862Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2965024Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2965641Z triton_flex_attention_1801 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2966324Z triton_flex_attention_1800 0.0108 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2966931Z triton_flex_attention_1816 0.0128 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2967549Z triton_flex_attention_1798 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2968166Z triton_flex_attention_1797 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2968772Z triton_flex_attention_1808 0.0133 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2969377Z triton_flex_attention_1814 0.0140 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2969983Z triton_flex_attention_1806 0.0150 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2970616Z triton_flex_attention_1799 0.0158 ms 66.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2971219Z triton_flex_attention_1812 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2971349Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4169 seconds precompiling for 24 choices 2025-12-04T09:58:55.2971389Z Autotune Choices Stats: 2025-12-04T09:58:55.2972168Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1835", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.2972395Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2972560Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2972837Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2973470Z triton_flex_attention_backward_1835 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2974096Z triton_flex_attention_backward_1829 0.0184 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2974737Z triton_flex_attention_backward_1826 0.0186 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2975363Z triton_flex_attention_backward_1827 0.0186 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2976046Z triton_flex_attention_backward_1837 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2976685Z triton_flex_attention_backward_1836 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2977311Z triton_flex_attention_backward_1834 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2977942Z triton_flex_attention_backward_1839 0.0221 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2978567Z triton_flex_attention_backward_1830 0.0228 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2979220Z triton_flex_attention_backward_1821 0.0230 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2979349Z SingleProcess AUTOTUNE benchmarking takes 0.2624 seconds and 0.8439 seconds precompiling for 22 choices 2025-12-04T09:58:55.2979422Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2979466Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2979502Z unimplemented [] 2025-12-04T09:58:55.2979562Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2979662Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2980243Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2980292Z graph_break [] 2025-12-04T09:58:55.2980367Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2980406Z Autotune Choices Stats: 2025-12-04T09:58:55.2981153Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1846", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009399999864399433, "best_triton_pos": 0} 2025-12-04T09:58:55.2981282Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2981396Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2981558Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2982172Z triton_flex_attention_1846 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2982787Z triton_flex_attention_1844 0.0102 ms 91.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2983402Z triton_flex_attention_1845 0.0120 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2984006Z triton_flex_attention_1843 0.0130 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2984616Z triton_flex_attention_1854 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2985233Z triton_flex_attention_1862 0.0134 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2985835Z triton_flex_attention_1842 0.0137 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2986476Z triton_flex_attention_1847 0.0138 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2987078Z triton_flex_attention_1860 0.0144 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2987722Z triton_flex_attention_1852 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2987855Z SingleProcess AUTOTUNE benchmarking takes 0.2274 seconds and 0.3833 seconds precompiling for 24 choices 2025-12-04T09:58:55.2987895Z Autotune Choices Stats: 2025-12-04T09:58:55.2988667Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01583999954164028, "best_triton_pos": 0} 2025-12-04T09:58:55.2988894Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2989057Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2989334Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2989974Z triton_flex_attention_backward_1881 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2990603Z triton_flex_attention_backward_1875 0.0184 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2991228Z triton_flex_attention_backward_1873 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2991878Z triton_flex_attention_backward_1872 0.0188 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2992514Z triton_flex_attention_backward_1883 0.0201 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2993150Z triton_flex_attention_backward_1882 0.0202 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2993786Z triton_flex_attention_backward_1880 0.0220 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2994418Z triton_flex_attention_backward_1885 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2995049Z triton_flex_attention_backward_1876 0.0224 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2995675Z triton_flex_attention_backward_1867 0.0232 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2995827Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.7872 seconds precompiling for 22 choices 2025-12-04T09:58:55.2995901Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.2995983Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.2996020Z unimplemented [] 2025-12-04T09:58:55.2996080Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.2996180Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.2996759Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.2996800Z graph_break [] 2025-12-04T09:58:55.2996874Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.2996927Z Autotune Choices Stats: 2025-12-04T09:58:55.2997669Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1893", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.2997811Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.2997926Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.2998090Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.2998710Z triton_flex_attention_1893 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.2999316Z triton_flex_attention_1892 0.0106 ms 95.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.2999946Z triton_flex_attention_1891 0.0117 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3000544Z triton_flex_attention_1890 0.0127 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3001167Z triton_flex_attention_1908 0.0130 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3001767Z triton_flex_attention_1889 0.0132 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3002388Z triton_flex_attention_1900 0.0135 ms 74.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3002986Z triton_flex_attention_1906 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3003593Z triton_flex_attention_1898 0.0148 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3004206Z triton_flex_attention_1904 0.0162 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3004345Z SingleProcess AUTOTUNE benchmarking takes 0.2478 seconds and 0.5052 seconds precompiling for 24 choices 2025-12-04T09:58:55.3004385Z Autotune Choices Stats: 2025-12-04T09:58:55.3005147Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1927", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.3005368Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3005547Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3005836Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3006493Z triton_flex_attention_backward_1927 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3007118Z triton_flex_attention_backward_1921 0.0183 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3007747Z triton_flex_attention_backward_1918 0.0185 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3008381Z triton_flex_attention_backward_1919 0.0186 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3009019Z triton_flex_attention_backward_1929 0.0201 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3009645Z triton_flex_attention_backward_1928 0.0202 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3010285Z triton_flex_attention_backward_1926 0.0217 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3010929Z triton_flex_attention_backward_1931 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3011554Z triton_flex_attention_backward_1922 0.0227 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3012260Z triton_flex_attention_backward_1913 0.0230 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3012402Z SingleProcess AUTOTUNE benchmarking takes 0.2709 seconds and 0.9165 seconds precompiling for 22 choices 2025-12-04T09:58:55.3012495Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:55.3012543Z Traceback (most recent call last): 2025-12-04T09:58:55.3012697Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:55.3012736Z self.assertTrue( 2025-12-04T09:58:55.3012856Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:55.3012906Z raise self.failureException(msg) 2025-12-04T09:58:55.3013033Z AssertionError: False is not true : Log file /tmp/tmpvi278rjz/flex_attention_configs.json was not created 2025-12-04T09:58:55.3013037Z 2025-12-04T09:58:55.3013112Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.3013279Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.3013281Z 2025-12-04T09:58:55.3013372Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.3013450Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3013491Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3013529Z unimplemented [] 2025-12-04T09:58:55.3013589Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3014179Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:55.3014294Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3014329Z graph_break [] 2025-12-04T09:58:55.3014404Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3014893Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:55.3014944Z current_size = base.storage().size() 2025-12-04T09:58:55.3014984Z Autotune Choices Stats: 2025-12-04T09:58:55.3015728Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:55.3015858Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3016015Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3016175Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3016797Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3017411Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3018014Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3018636Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3019256Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3019855Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3020461Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3021059Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3021685Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3022285Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3022415Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:55.3022469Z Autotune Choices Stats: 2025-12-04T09:58:55.3023219Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.3023451Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3023618Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3023897Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3024527Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3025150Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3025787Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3026435Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3027076Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3027717Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3028339Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3028966Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3029592Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3030245Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3030375Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:55.3030449Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3030493Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3030530Z unimplemented [] 2025-12-04T09:58:55.3030593Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3030691Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3031271Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3031318Z graph_break [] 2025-12-04T09:58:55.3031393Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3031432Z Autotune Choices Stats: 2025-12-04T09:58:55.3032174Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:55.3032305Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3032418Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3032577Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3033186Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3033799Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3034418Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3035022Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3035632Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3036277Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3036879Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3037480Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3038094Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3038705Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3038835Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:55.3038874Z Autotune Choices Stats: 2025-12-04T09:58:55.3039650Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.3039878Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3040043Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3040321Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3040949Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3041578Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3042197Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3042836Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3043466Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3044103Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3044739Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3045364Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3046059Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3046686Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3046842Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:55.3046916Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3046958Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3046994Z unimplemented [] 2025-12-04T09:58:55.3047055Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3047156Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3047735Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3047772Z graph_break [] 2025-12-04T09:58:55.3047860Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3047901Z Autotune Choices Stats: 2025-12-04T09:58:55.3048649Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:55.3048791Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3048905Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3049068Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3049689Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3050302Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3050930Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3051528Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3052146Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3052750Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3053363Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3053956Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3054560Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3055173Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3055312Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:55.3055353Z Autotune Choices Stats: 2025-12-04T09:58:55.3056160Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.3056379Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3056560Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3056855Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3057489Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3058121Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3058749Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3059381Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3060036Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3060664Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3061299Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3061940Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3062564Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3063188Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3063330Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:55.3063405Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3063449Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3063486Z unimplemented [] 2025-12-04T09:58:55.3063546Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3063645Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3064232Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3064272Z graph_break [] 2025-12-04T09:58:55.3064344Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3064385Z Autotune Choices Stats: 2025-12-04T09:58:55.3065135Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:55.3065279Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3065391Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3065552Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3066201Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3066804Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3067407Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3068041Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3068643Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3069258Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3069861Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3070476Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3071074Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3071677Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3071819Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:55.3071857Z Autotune Choices Stats: 2025-12-04T09:58:55.3072636Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.3072852Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3073018Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3073294Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3073932Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3074570Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3075193Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3075817Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3076491Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3077136Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3077776Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3078403Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3079045Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3079670Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3079801Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:55.3079876Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3079917Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3079954Z unimplemented [] 2025-12-04T09:58:55.3080014Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3080114Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3080701Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3080740Z graph_break [] 2025-12-04T09:58:55.3080823Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3080864Z Autotune Choices Stats: 2025-12-04T09:58:55.3081605Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.3081736Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3081867Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3082027Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3082647Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3083253Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3083861Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3084466Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3085085Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3085694Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3086353Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3086956Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3087565Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3088170Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3088299Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:55.3088341Z Autotune Choices Stats: 2025-12-04T09:58:55.3089104Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.3089365Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3089531Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3089808Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3090460Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3091083Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3091713Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3092342Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3092971Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3093621Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3094247Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3094885Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3095512Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3096185Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3096314Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:55.3096389Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3096431Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3096468Z unimplemented [] 2025-12-04T09:58:55.3096528Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3096626Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3097205Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3097258Z graph_break [] 2025-12-04T09:58:55.3097332Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3097375Z Autotune Choices Stats: 2025-12-04T09:58:55.3098134Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.3098261Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3098374Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3098533Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3099160Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3099776Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3100380Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3100988Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3101603Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3102226Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3102823Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3103439Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3104050Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3104654Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3104782Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:55.3104824Z Autotune Choices Stats: 2025-12-04T09:58:55.3105583Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.3105804Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3106018Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3106311Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3106939Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3107580Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3108206Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3108834Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3109460Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3110091Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3110737Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3111363Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3111991Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3112630Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3112761Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:55.3112838Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3112884Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3112922Z unimplemented [] 2025-12-04T09:58:55.3112981Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3113082Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3113655Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3113694Z graph_break [] 2025-12-04T09:58:55.3113770Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3113809Z Autotune Choices Stats: 2025-12-04T09:58:55.3114551Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:55.3114693Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3114818Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3114978Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3115592Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3116239Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3116856Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3117456Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3118053Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3118651Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3119280Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3119883Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3120496Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3121113Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3121244Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:55.3121285Z Autotune Choices Stats: 2025-12-04T09:58:55.3122042Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.3122263Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3122428Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3122704Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3123366Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3123991Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3124627Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3125266Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3125898Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3126560Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3127186Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3127852Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3128473Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3129115Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3129265Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:55.3129341Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3129384Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3129423Z unimplemented [] 2025-12-04T09:58:55.3129482Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3129582Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3130166Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3130205Z graph_break [] 2025-12-04T09:58:55.3130281Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3130319Z Autotune Choices Stats: 2025-12-04T09:58:55.3131056Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:55.3131183Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3131310Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3131472Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3132090Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3132699Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3133312Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3133924Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3134533Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3135137Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3135737Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3136413Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3137014Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3137632Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3137775Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:55.3137815Z Autotune Choices Stats: 2025-12-04T09:58:55.3138573Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.3138791Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3138962Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3139243Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3139873Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3140527Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3141154Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3141791Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3142428Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3143053Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3143682Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3144308Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3144954Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3145581Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3145715Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:55.3145789Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3145850Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3145889Z unimplemented [] 2025-12-04T09:58:55.3145987Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3146100Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3146679Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3146716Z graph_break [] 2025-12-04T09:58:55.3146796Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3146837Z Autotune Choices Stats: 2025-12-04T09:58:55.3147572Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.3147703Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3147817Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3147980Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3148592Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3149229Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3149832Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3150451Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3151067Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3151670Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3152273Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3152877Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3153502Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3154104Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3154235Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:55.3154274Z Autotune Choices Stats: 2025-12-04T09:58:55.3155044Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.3155271Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3155439Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3155716Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3156383Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3157012Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3157665Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3158290Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3158927Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3159564Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3160188Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3160813Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3161439Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3162096Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3162225Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:55.3162298Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3162342Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3162379Z unimplemented [] 2025-12-04T09:58:55.3162439Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3162537Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3163124Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3163172Z graph_break [] 2025-12-04T09:58:55.3163247Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3163287Z Autotune Choices Stats: 2025-12-04T09:58:55.3164036Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:55.3164163Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3164278Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3164441Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3165050Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3165665Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3166326Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3166933Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3167545Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3168166Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3168773Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3169378Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3169979Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3170607Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3170739Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:55.3170779Z Autotune Choices Stats: 2025-12-04T09:58:55.3171548Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:55.3171774Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3171937Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3172215Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3172844Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3173478Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3174105Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3174753Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3175383Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3176063Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3176700Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3177327Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3177955Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3178582Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3178731Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:55.3178818Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3178862Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3178899Z unimplemented [] 2025-12-04T09:58:55.3178964Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3179063Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3179635Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3179673Z graph_break [] 2025-12-04T09:58:55.3179747Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3179800Z Autotune Choices Stats: 2025-12-04T09:58:55.3180538Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:55.3180676Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3180790Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3180953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3181567Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3182179Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3182789Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3183399Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3184002Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3184618Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3185243Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3185844Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3186491Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3187090Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3187255Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:55.3187295Z Autotune Choices Stats: 2025-12-04T09:58:55.3188057Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:55.3188278Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3188454Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3188748Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3189381Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3190003Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3190625Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3191252Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3191906Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3192533Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3193171Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3193814Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3194440Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3195063Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3195205Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:55.3195280Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3195321Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3195359Z unimplemented [] 2025-12-04T09:58:55.3195418Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3195518Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3196143Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3196184Z graph_break [] 2025-12-04T09:58:55.3196258Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3196299Z Autotune Choices Stats: 2025-12-04T09:58:55.3197055Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:55.3197195Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3197310Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3197470Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3198084Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3198687Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3199284Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3199910Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3200513Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3201127Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3201731Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3202347Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3202946Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3203550Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3203691Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:55.3203735Z Autotune Choices Stats: 2025-12-04T09:58:55.3204509Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:55.3204728Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3204896Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3205172Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3205813Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3206484Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3207107Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3207735Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3208380Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3209026Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3209650Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3210301Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3210941Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3211562Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3211691Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:55.3211767Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3211810Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3211849Z unimplemented [] 2025-12-04T09:58:55.3211910Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3212013Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3212595Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3212634Z graph_break [] 2025-12-04T09:58:55.3212722Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3212763Z Autotune Choices Stats: 2025-12-04T09:58:55.3213505Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:55.3213634Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3213747Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3213918Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3214541Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3215149Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3215750Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3216398Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3217038Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3217642Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3218256Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3218865Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3219479Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3220084Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3220214Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:55.3220256Z Autotune Choices Stats: 2025-12-04T09:58:55.3221017Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.3221243Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3221420Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3221697Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3222332Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3222970Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3223604Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3224230Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3224861Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3225518Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3226195Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3226835Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3227464Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3228098Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3228228Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:55.3228303Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3228343Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3228382Z unimplemented [] 2025-12-04T09:58:55.3228441Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3228541Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3229114Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3229166Z graph_break [] 2025-12-04T09:58:55.3229240Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3229279Z Autotune Choices Stats: 2025-12-04T09:58:55.3230037Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:55.3230164Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3230281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3230441Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3231067Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3231681Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3232296Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3232897Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3233504Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3234133Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3234735Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3235348Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3235983Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3236604Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3236731Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:55.3236773Z Autotune Choices Stats: 2025-12-04T09:58:55.3237525Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:55.3237741Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3237920Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3238206Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3238842Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3239480Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3240102Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3240736Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3241366Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3241993Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3242642Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3243269Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3243913Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3244546Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3244675Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:55.3244752Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3244795Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3244835Z unimplemented [] 2025-12-04T09:58:55.3244896Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3244997Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3245574Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3245611Z graph_break [] 2025-12-04T09:58:55.3245689Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3245728Z Autotune Choices Stats: 2025-12-04T09:58:55.3246494Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:55.3246644Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3246770Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3246934Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3247546Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3248167Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3248778Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3249381Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3249993Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3250589Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3251214Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3251813Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3252427Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3253043Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3253175Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:55.3253216Z Autotune Choices Stats: 2025-12-04T09:58:55.3253976Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.3254195Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3254366Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3254643Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3255298Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3255959Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3256606Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3257249Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3257880Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3258509Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3259130Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3259783Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3260405Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3261038Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3261178Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:55.3261253Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3261298Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3261336Z unimplemented [] 2025-12-04T09:58:55.3261396Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3261496Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3262070Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3262109Z graph_break [] 2025-12-04T09:58:55.3262184Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3262224Z Autotune Choices Stats: 2025-12-04T09:58:55.3262969Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.3263095Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3263220Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3263379Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3264004Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3264607Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3265220Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3265833Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3266479Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3267088Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3267693Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3268329Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3268930Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3269542Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3269686Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:55.3269726Z Autotune Choices Stats: 2025-12-04T09:58:55.3270491Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.3270711Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3270878Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3271156Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3271784Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3272429Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3273058Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3273702Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3274338Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3274971Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3275597Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3276277Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3276928Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3277555Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3277684Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:55.3277757Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3277812Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3277850Z unimplemented [] 2025-12-04T09:58:55.3277915Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3278014Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3278604Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3278643Z graph_break [] 2025-12-04T09:58:55.3278719Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3278760Z Autotune Choices Stats: 2025-12-04T09:58:55.3279505Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.3279633Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3279748Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3279911Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3280522Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3281141Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3281748Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3282368Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3282984Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3283592Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3284203Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3284803Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3285431Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3286078Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3286211Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:55.3286250Z Autotune Choices Stats: 2025-12-04T09:58:55.3287025Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.3287254Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3287420Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3287700Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3288339Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3288966Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3289624Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3290251Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3290893Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3291528Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3292153Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3292780Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3293408Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3294044Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3294173Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:55.3294247Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3294292Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3294330Z unimplemented [] 2025-12-04T09:58:55.3294392Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3294489Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3295077Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3295126Z graph_break [] 2025-12-04T09:58:55.3295201Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3295245Z Autotune Choices Stats: 2025-12-04T09:58:55.3296024Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:55.3296154Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3296269Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3296433Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3297052Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3297662Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3298301Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3298904Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3299521Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3300145Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3300754Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3301358Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3301962Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3302588Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3302719Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:55.3302758Z Autotune Choices Stats: 2025-12-04T09:58:55.3303524Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.3303741Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3303915Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3304192Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3304827Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3305453Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3306113Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3306767Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3307398Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3308040Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3308676Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3309306Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3309930Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3310551Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3310694Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:55.3310785Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3310826Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3310867Z unimplemented [] 2025-12-04T09:58:55.3310928Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3311028Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3311606Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3311646Z graph_break [] 2025-12-04T09:58:55.3311721Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3311762Z Autotune Choices Stats: 2025-12-04T09:58:55.3312516Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.3312655Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3312771Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3312933Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3313548Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3314154Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3314768Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3315379Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3316028Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3316653Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3317272Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3317875Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3318482Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3319083Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3319227Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:55.3319284Z Autotune Choices Stats: 2025-12-04T09:58:55.3320039Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.3320259Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3320436Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3320710Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3321360Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3321983Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3322606Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3323228Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3323878Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3324498Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3325135Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3325776Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3326435Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3327065Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3327193Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:55.3327289Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3327332Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3327374Z unimplemented [] 2025-12-04T09:58:55.3327434Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3327536Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3328121Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3328160Z graph_break [] 2025-12-04T09:58:55.3328233Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3328275Z Autotune Choices Stats: 2025-12-04T09:58:55.3329030Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.3329167Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3329285Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3329446Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3330063Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3330671Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3331276Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3331890Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3332503Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3333109Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3333725Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3334339Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3334939Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3335544Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3335684Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:55.3335725Z Autotune Choices Stats: 2025-12-04T09:58:55.3336539Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:55.3336755Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3336922Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3337199Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3337850Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3338489Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3339111Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3339739Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3340363Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3341017Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3341643Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3342282Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3342917Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3343540Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3343672Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:55.3343748Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3343790Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3343829Z unimplemented [] 2025-12-04T09:58:55.3343889Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3343989Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3344574Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3344612Z graph_break [] 2025-12-04T09:58:55.3344697Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3344737Z Autotune Choices Stats: 2025-12-04T09:58:55.3345481Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.3345609Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3345724Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3345894Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3346537Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3347156Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3347761Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3348364Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3348994Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3349588Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3350204Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3350808Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3351422Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3352023Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3352152Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:55.3352194Z Autotune Choices Stats: 2025-12-04T09:58:55.3352955Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.3353183Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3353363Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3353638Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3354270Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3354906Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3355540Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3356203Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3356832Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3357480Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3358120Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3358754Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3359385Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3360023Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3360151Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:55.3360226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3361935Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3361976Z unimplemented [] 2025-12-04T09:58:55.3362040Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3362143Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3362726Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3362783Z graph_break [] 2025-12-04T09:58:55.3362859Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3362899Z Autotune Choices Stats: 2025-12-04T09:58:55.3363656Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:55.3363784Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3363904Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3364067Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3364692Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3365301Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3365908Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3366551Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3367154Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3367790Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3368395Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3369011Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3369610Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3370223Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3370353Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:55.3370394Z Autotune Choices Stats: 2025-12-04T09:58:55.3371157Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:55.3371373Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3371557Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3371835Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3372481Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3373123Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3373750Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3374378Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3375006Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3375639Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3376332Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3376964Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3377606Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3378229Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3378373Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:55.3378448Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3378492Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3378530Z unimplemented [] 2025-12-04T09:58:55.3378594Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3378695Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3379268Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3379305Z graph_break [] 2025-12-04T09:58:55.3379380Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3379419Z Autotune Choices Stats: 2025-12-04T09:58:55.3380163Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:55.3380304Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3380430Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3380592Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3381205Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3381818Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3382431Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3383032Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3383638Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3384246Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3384881Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3385485Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3386150Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3386774Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3386905Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:55.3386945Z Autotune Choices Stats: 2025-12-04T09:58:55.3387701Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.3387919Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3388084Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3388360Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3389046Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3389669Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3390313Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3390939Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3391584Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3392214Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3392838Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3393500Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3394119Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3394753Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3394890Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:55.3394964Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3395006Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3395042Z unimplemented [] 2025-12-04T09:58:55.3395102Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3395200Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3395777Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3395815Z graph_break [] 2025-12-04T09:58:55.3395887Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3395968Z Autotune Choices Stats: 2025-12-04T09:58:55.3396708Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:55.3396835Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3396972Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3397136Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3397760Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3398367Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3398988Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3399608Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3400211Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3400811Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3401414Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3402042Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3402647Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3403276Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3403416Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:55.3403456Z Autotune Choices Stats: 2025-12-04T09:58:55.3404219Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.3404433Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3404602Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3404880Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3405513Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3406200Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3406819Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3407469Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3408113Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3408743Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3409373Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3410002Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3410658Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3411284Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3411413Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:55.3411487Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3411541Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3411578Z unimplemented [] 2025-12-04T09:58:55.3411639Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3411748Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3412321Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3412365Z graph_break [] 2025-12-04T09:58:55.3412440Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3412481Z Autotune Choices Stats: 2025-12-04T09:58:55.3413220Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.3413352Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3413469Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3413628Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3414241Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3414867Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3415475Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3416135Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3416756Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3417357Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3417962Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3418567Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3419215Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3419815Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3419944Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:55.3419985Z Autotune Choices Stats: 2025-12-04T09:58:55.3420763Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.3420995Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3421163Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3421437Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3422072Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3422696Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3423348Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3423972Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3424611Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3425255Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3425880Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3426541Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3427166Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3427835Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3427969Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:55.3428042Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3428083Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3428120Z unimplemented [] 2025-12-04T09:58:55.3428181Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3428282Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3428867Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3428918Z graph_break [] 2025-12-04T09:58:55.3428990Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3429030Z Autotune Choices Stats: 2025-12-04T09:58:55.3429765Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:55.3429892Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3430005Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3430165Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3430774Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3431383Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3431990Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3432596Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3433216Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3433831Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3434434Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3435049Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3435652Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3436338Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3436467Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:55.3436507Z Autotune Choices Stats: 2025-12-04T09:58:55.3437280Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.3437512Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3437677Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3437953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3438587Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3439220Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3439849Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3440500Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3441130Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3441776Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3442412Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3443040Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3443667Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3444298Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3444435Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:55.3444518Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3444561Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3444599Z unimplemented [] 2025-12-04T09:58:55.3444658Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3444760Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3445332Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3445370Z graph_break [] 2025-12-04T09:58:55.3445442Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3445493Z Autotune Choices Stats: 2025-12-04T09:58:55.3446266Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.3446413Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3446527Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3446687Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3447295Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3447904Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3448527Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3449142Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3449742Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3450355Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3450980Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3451581Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3452186Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3452803Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3452941Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:55.3452981Z Autotune Choices Stats: 2025-12-04T09:58:55.3453744Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.3453964Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3454139Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3454423Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3455062Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3455689Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3456361Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3456987Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3457647Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3458275Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3458909Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3459555Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3460186Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3460818Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3460955Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:55.3461030Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3461071Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3461110Z unimplemented [] 2025-12-04T09:58:55.3461169Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3461269Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3461858Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3461895Z graph_break [] 2025-12-04T09:58:55.3461968Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3462007Z Autotune Choices Stats: 2025-12-04T09:58:55.3462756Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:55.3462891Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3463005Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3463164Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3463773Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3464379Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3464982Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3465606Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3466246Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3466869Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3467476Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3468085Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3468693Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3469294Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3469435Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:55.3469475Z Autotune Choices Stats: 2025-12-04T09:58:55.3470248Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.3470462Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3470629Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3470904Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3471545Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3472178Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3472799Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3473424Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3474066Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3474704Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3475339Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3476008Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3476655Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3477280Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3477409Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:55.3477483Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3477525Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3477561Z unimplemented [] 2025-12-04T09:58:55.3477621Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3477720Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3478309Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3478358Z graph_break [] 2025-12-04T09:58:55.3478432Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3478472Z Autotune Choices Stats: 2025-12-04T09:58:55.3479214Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:55.3479342Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3479472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3479631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3480252Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3480857Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3481461Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3482064Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3482696Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3483302Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3483918Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3484521Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3485140Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3485744Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3485873Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:55.3485912Z Autotune Choices Stats: 2025-12-04T09:58:55.3486703Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.3486947Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3487115Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3487393Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3488044Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3488674Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3489317Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3489943Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3490571Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3491219Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3491853Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3492491Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3493116Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3493757Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3493889Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:55.3493963Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3494005Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3494041Z unimplemented [] 2025-12-04T09:58:55.3494103Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3494201Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3494776Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3494824Z graph_break [] 2025-12-04T09:58:55.3494902Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3494940Z Autotune Choices Stats: 2025-12-04T09:58:55.3495690Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:55.3495817Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3495968Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3496129Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3496760Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3497379Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3497985Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3498591Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3499204Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3499839Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3500442Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3501060Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3501674Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3502278Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3502408Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:55.3502448Z Autotune Choices Stats: 2025-12-04T09:58:55.3503209Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.3503442Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3503609Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3503901Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3504537Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3505176Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3505811Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3506474Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3507103Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3507733Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3508390Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3509019Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3509663Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3510301Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3510429Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:55.3510503Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3510548Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3510584Z unimplemented [] 2025-12-04T09:58:55.3510647Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3510749Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3511327Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3511368Z graph_break [] 2025-12-04T09:58:55.3511444Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3511484Z Autotune Choices Stats: 2025-12-04T09:58:55.3512228Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.3512377Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3512492Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3512654Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3513270Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3513885Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3514508Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3515112Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3515722Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3516366Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3517002Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3517605Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3518223Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3518837Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3518968Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:55.3519009Z Autotune Choices Stats: 2025-12-04T09:58:55.3519770Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.3519988Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3520153Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3520446Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3521091Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3521716Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3522354Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3522992Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3523616Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3524244Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3524871Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3525522Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3526189Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3526831Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3526974Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:55.3527049Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3527090Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3527128Z unimplemented [] 2025-12-04T09:58:55.3527189Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3527289Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3527867Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3527905Z graph_break [] 2025-12-04T09:58:55.3527978Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3528020Z Autotune Choices Stats: 2025-12-04T09:58:55.3528768Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:55.3528910Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3529024Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3529185Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3529818Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3530422Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3531048Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3531663Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3532261Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3532863Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3533468Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3534095Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3534698Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3535310Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3535450Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:55.3535490Z Autotune Choices Stats: 2025-12-04T09:58:55.3536284Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.3536503Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3536669Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3536949Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3537584Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3538252Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3538876Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3539518Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3540162Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3540790Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3541422Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3542046Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3542697Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3543324Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3543452Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:55.3543542Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3543585Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3543633Z unimplemented [] 2025-12-04T09:58:55.3543693Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3543794Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3544367Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3544404Z graph_break [] 2025-12-04T09:58:55.3544477Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3544517Z Autotune Choices Stats: 2025-12-04T09:58:55.3545259Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:55.3545385Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3545501Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3545662Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3546331Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3546955Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3547560Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3548182Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3548797Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3549397Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3550004Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3550614Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3551248Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3551852Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3551990Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:55.3552031Z Autotune Choices Stats: 2025-12-04T09:58:55.3552784Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.3553019Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3553187Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3553463Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3554099Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3554731Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3555377Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3556040Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3556688Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3557334Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3557957Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3558588Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3559217Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3559870Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3559998Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:55.3560074Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3560117Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3560155Z unimplemented [] 2025-12-04T09:58:55.3560216Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3560318Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3560902Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3560950Z graph_break [] 2025-12-04T09:58:55.3561025Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3561064Z Autotune Choices Stats: 2025-12-04T09:58:55.3561815Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:55.3561943Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3562056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3562215Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3562828Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3563465Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3564074Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3564687Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3565291Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3565908Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3566536Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3567141Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3567762Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3568384Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3568514Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:55.3568555Z Autotune Choices Stats: 2025-12-04T09:58:55.3569330Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.3569558Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3569726Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3570004Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3570646Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3571273Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3571903Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3572553Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3573185Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3573823Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3574462Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3575092Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3575720Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3576377Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3576526Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:55.3576600Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3576645Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3576683Z unimplemented [] 2025-12-04T09:58:55.3576744Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3576844Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3577422Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3577458Z graph_break [] 2025-12-04T09:58:55.3577544Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3577584Z Autotune Choices Stats: 2025-12-04T09:58:55.3578326Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.3578472Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3578586Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3578746Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3579367Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3579973Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3580599Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3581205Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3581823Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3582423Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3583041Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3583647Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3584252Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3584873Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3585004Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:55.3585045Z Autotune Choices Stats: 2025-12-04T09:58:55.3585804Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.3586078Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3586242Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3586530Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3587165Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3587791Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3588416Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3589070Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3589696Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3590339Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3590964Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3591604Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3592230Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3592860Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3593002Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:55.3593077Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3593119Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3593156Z unimplemented [] 2025-12-04T09:58:55.3593217Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3593328Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3593903Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3593940Z graph_break [] 2025-12-04T09:58:55.3594014Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3594053Z Autotune Choices Stats: 2025-12-04T09:58:55.3594818Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:55.3594957Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3595069Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3595232Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3595850Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3596494Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3597096Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3597728Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3598331Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3598946Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3599573Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3600180Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3600782Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3601387Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3601529Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:55.3601569Z Autotune Choices Stats: 2025-12-04T09:58:55.3602344Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.3602562Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3602726Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3603016Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3603652Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3604289Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3604913Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3605539Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3606242Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3606869Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3607513Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3608155Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3608782Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3609407Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3609535Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:55.3609610Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3609654Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3609692Z unimplemented [] 2025-12-04T09:58:55.3609766Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3609864Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3610457Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3610496Z graph_break [] 2025-12-04T09:58:55.3610569Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3610610Z Autotune Choices Stats: 2025-12-04T09:58:55.3611348Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:55.3611476Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3611599Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3611769Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3612379Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3612981Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3613586Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3614193Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3614822Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3615424Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3616079Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3616693Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3617303Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3617908Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3618037Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:55.3618076Z Autotune Choices Stats: 2025-12-04T09:58:55.3618836Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.3619078Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3619243Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3619522Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3620160Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3620793Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3621422Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3622050Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3622683Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3623336Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3623965Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3624603Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3625240Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3625871Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3626031Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:55.3626105Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3626147Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3626184Z unimplemented [] 2025-12-04T09:58:55.3626244Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3626346Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3626922Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3626981Z graph_break [] 2025-12-04T09:58:55.3627053Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3627094Z Autotune Choices Stats: 2025-12-04T09:58:55.3627848Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1708", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.3627977Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3628092Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3628253Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3628881Z triton_flex_attention_1708 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3629495Z triton_flex_attention_1709 0.0109 ms 96.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3630091Z triton_flex_attention_1707 0.0117 ms 89.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3630704Z triton_flex_attention_1705 0.0130 ms 80.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3631315Z triton_flex_attention_1724 0.0135 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3631938Z triton_flex_attention_1706 0.0136 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3632547Z triton_flex_attention_1716 0.0142 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3633161Z triton_flex_attention_1722 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3633772Z triton_flex_attention_1714 0.0149 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3634374Z triton_flex_attention_1720 0.0162 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3634504Z SingleProcess AUTOTUNE benchmarking takes 0.2434 seconds and 0.4106 seconds precompiling for 24 choices 2025-12-04T09:58:55.3634544Z Autotune Choices Stats: 2025-12-04T09:58:55.3635300Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015359999611973763, "best_triton_pos": 0} 2025-12-04T09:58:55.3635529Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3635691Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3636024Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3636658Z triton_flex_attention_backward_1743 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3637307Z triton_flex_attention_backward_1737 0.0181 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3637943Z triton_flex_attention_backward_1734 0.0187 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3638569Z triton_flex_attention_backward_1735 0.0188 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3639200Z triton_flex_attention_backward_1745 0.0203 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3639821Z triton_flex_attention_backward_1744 0.0203 ms 75.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3640473Z triton_flex_attention_backward_1742 0.0218 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3641103Z triton_flex_attention_backward_1747 0.0220 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3641745Z triton_flex_attention_backward_1738 0.0228 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3642387Z triton_flex_attention_backward_1729 0.0230 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3642514Z SingleProcess AUTOTUNE benchmarking takes 0.2527 seconds and 0.7882 seconds precompiling for 22 choices 2025-12-04T09:58:55.3642589Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3642633Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3642673Z unimplemented [] 2025-12-04T09:58:55.3642733Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3642831Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3643405Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3643442Z graph_break [] 2025-12-04T09:58:55.3643516Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3643564Z Autotune Choices Stats: 2025-12-04T09:58:55.3644320Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1754", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009398999623954296, "best_triton_pos": 0} 2025-12-04T09:58:55.3644457Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3644572Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3644734Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3645341Z triton_flex_attention_1754 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3645998Z triton_flex_attention_1755 0.0104 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3646614Z triton_flex_attention_1752 0.0112 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3647218Z triton_flex_attention_1753 0.0117 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3647820Z triton_flex_attention_1750 0.0120 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3648424Z triton_flex_attention_1770 0.0132 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3649056Z triton_flex_attention_1751 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3649664Z triton_flex_attention_1762 0.0140 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3650286Z triton_flex_attention_1768 0.0146 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3650905Z triton_flex_attention_1760 0.0150 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3651034Z SingleProcess AUTOTUNE benchmarking takes 0.2227 seconds and 0.4678 seconds precompiling for 24 choices 2025-12-04T09:58:55.3651074Z Autotune Choices Stats: 2025-12-04T09:58:55.3651830Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.3652049Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3652216Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3652503Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3653145Z triton_flex_attention_backward_1789 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3653770Z triton_flex_attention_backward_1783 0.0184 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3654404Z triton_flex_attention_backward_1780 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3655049Z triton_flex_attention_backward_1781 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3655677Z triton_flex_attention_backward_1791 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3656336Z triton_flex_attention_backward_1790 0.0204 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3656961Z triton_flex_attention_backward_1788 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3657621Z triton_flex_attention_backward_1793 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3658249Z triton_flex_attention_backward_1784 0.0226 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3658899Z triton_flex_attention_backward_1775 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3659041Z SingleProcess AUTOTUNE benchmarking takes 0.2632 seconds and 0.8758 seconds precompiling for 22 choices 2025-12-04T09:58:55.3659119Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3659160Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3659198Z unimplemented [] 2025-12-04T09:58:55.3659259Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3659361Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3659934Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3659972Z graph_break [] 2025-12-04T09:58:55.3660047Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3660086Z Autotune Choices Stats: 2025-12-04T09:58:55.3660827Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1801", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.3660964Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3661079Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3661240Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3661860Z triton_flex_attention_1801 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3662460Z triton_flex_attention_1800 0.0108 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3663079Z triton_flex_attention_1816 0.0128 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3663697Z triton_flex_attention_1798 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3664301Z triton_flex_attention_1797 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3664907Z triton_flex_attention_1808 0.0133 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3665525Z triton_flex_attention_1814 0.0140 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3666190Z triton_flex_attention_1806 0.0150 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3666794Z triton_flex_attention_1799 0.0158 ms 66.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3667411Z triton_flex_attention_1812 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3667551Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4169 seconds precompiling for 24 choices 2025-12-04T09:58:55.3667592Z Autotune Choices Stats: 2025-12-04T09:58:55.3668361Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1835", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.3668577Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3668742Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3669016Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3669646Z triton_flex_attention_backward_1835 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3670302Z triton_flex_attention_backward_1829 0.0184 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3670925Z triton_flex_attention_backward_1826 0.0186 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3671555Z triton_flex_attention_backward_1827 0.0186 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3672198Z triton_flex_attention_backward_1837 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3672832Z triton_flex_attention_backward_1836 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3674405Z triton_flex_attention_backward_1834 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3675056Z triton_flex_attention_backward_1839 0.0221 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3675709Z triton_flex_attention_backward_1830 0.0228 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3676398Z triton_flex_attention_backward_1821 0.0230 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3676544Z SingleProcess AUTOTUNE benchmarking takes 0.2624 seconds and 0.8439 seconds precompiling for 22 choices 2025-12-04T09:58:55.3676620Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3676664Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3676702Z unimplemented [] 2025-12-04T09:58:55.3676765Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3676864Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3677444Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3677481Z graph_break [] 2025-12-04T09:58:55.3677555Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3677596Z Autotune Choices Stats: 2025-12-04T09:58:55.3678342Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1846", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009399999864399433, "best_triton_pos": 0} 2025-12-04T09:58:55.3678516Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3678632Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3678808Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3679439Z triton_flex_attention_1846 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3680047Z triton_flex_attention_1844 0.0102 ms 91.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3680669Z triton_flex_attention_1845 0.0120 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3681272Z triton_flex_attention_1843 0.0130 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3681876Z triton_flex_attention_1854 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3682483Z triton_flex_attention_1862 0.0134 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3683104Z triton_flex_attention_1842 0.0137 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3683715Z triton_flex_attention_1847 0.0138 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3684329Z triton_flex_attention_1860 0.0144 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3684935Z triton_flex_attention_1852 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3685074Z SingleProcess AUTOTUNE benchmarking takes 0.2274 seconds and 0.3833 seconds precompiling for 24 choices 2025-12-04T09:58:55.3685114Z Autotune Choices Stats: 2025-12-04T09:58:55.3685872Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01583999954164028, "best_triton_pos": 0} 2025-12-04T09:58:55.3686130Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3686295Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3686572Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3687234Z triton_flex_attention_backward_1881 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3687873Z triton_flex_attention_backward_1875 0.0184 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3688520Z triton_flex_attention_backward_1873 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3689154Z triton_flex_attention_backward_1872 0.0188 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3689804Z triton_flex_attention_backward_1883 0.0201 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3690433Z triton_flex_attention_backward_1882 0.0202 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3691055Z triton_flex_attention_backward_1880 0.0220 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3691715Z triton_flex_attention_backward_1885 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3692368Z triton_flex_attention_backward_1876 0.0224 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3692992Z triton_flex_attention_backward_1867 0.0232 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3693124Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.7872 seconds precompiling for 22 choices 2025-12-04T09:58:55.3693199Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3693242Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3693282Z unimplemented [] 2025-12-04T09:58:55.3693344Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3693454Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3694032Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3694071Z graph_break [] 2025-12-04T09:58:55.3694143Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3694186Z Autotune Choices Stats: 2025-12-04T09:58:55.3694930Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1893", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.3695058Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3695174Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3695346Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3695995Z triton_flex_attention_1893 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3696629Z triton_flex_attention_1892 0.0106 ms 95.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3697230Z triton_flex_attention_1891 0.0117 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3697847Z triton_flex_attention_1890 0.0127 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3698458Z triton_flex_attention_1908 0.0130 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3699060Z triton_flex_attention_1889 0.0132 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3699662Z triton_flex_attention_1900 0.0135 ms 74.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3700282Z triton_flex_attention_1906 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3700906Z triton_flex_attention_1898 0.0148 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3701510Z triton_flex_attention_1904 0.0162 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3701639Z SingleProcess AUTOTUNE benchmarking takes 0.2478 seconds and 0.5052 seconds precompiling for 24 choices 2025-12-04T09:58:55.3701680Z Autotune Choices Stats: 2025-12-04T09:58:55.3702446Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1927", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.3702666Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3702832Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3703111Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3703743Z triton_flex_attention_backward_1927 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3704383Z triton_flex_attention_backward_1921 0.0183 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3705029Z triton_flex_attention_backward_1918 0.0185 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3705659Z triton_flex_attention_backward_1919 0.0186 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3706348Z triton_flex_attention_backward_1929 0.0201 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3706978Z triton_flex_attention_backward_1928 0.0202 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3707605Z triton_flex_attention_backward_1926 0.0217 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3708235Z triton_flex_attention_backward_1931 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3708876Z triton_flex_attention_backward_1922 0.0227 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3709535Z triton_flex_attention_backward_1913 0.0230 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3709663Z SingleProcess AUTOTUNE benchmarking takes 0.2709 seconds and 0.9165 seconds precompiling for 22 choices 2025-12-04T09:58:55.3709742Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3709784Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3709821Z unimplemented [] 2025-12-04T09:58:55.3709881Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3709980Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3710563Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3710601Z graph_break [] 2025-12-04T09:58:55.3710674Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3710717Z Autotune Choices Stats: 2025-12-04T09:58:55.3711463Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1938", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009960000403225422, "best_triton_pos": 0} 2025-12-04T09:58:55.3711592Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3711707Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3711869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3712475Z triton_flex_attention_1938 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3713088Z triton_flex_attention_1936 0.0100 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3713714Z triton_flex_attention_1939 0.0101 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3714321Z triton_flex_attention_1935 0.0129 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3714936Z triton_flex_attention_1937 0.0134 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3715543Z triton_flex_attention_1946 0.0137 ms 72.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3716277Z triton_flex_attention_1954 0.0139 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3716879Z triton_flex_attention_1952 0.0146 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3717495Z triton_flex_attention_1944 0.0151 ms 66.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3718125Z triton_flex_attention_1950 0.0165 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3718254Z SingleProcess AUTOTUNE benchmarking takes 0.2498 seconds and 0.4270 seconds precompiling for 24 choices 2025-12-04T09:58:55.3718294Z Autotune Choices Stats: 2025-12-04T09:58:55.3719052Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1973", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.3719285Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3719449Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3719726Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3720363Z triton_flex_attention_backward_1973 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3720992Z triton_flex_attention_backward_1967 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3721630Z triton_flex_attention_backward_1964 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3722283Z triton_flex_attention_backward_1965 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3722916Z triton_flex_attention_backward_1975 0.0199 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3723556Z triton_flex_attention_backward_1974 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3724180Z triton_flex_attention_backward_1972 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3724809Z triton_flex_attention_backward_1977 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3725433Z triton_flex_attention_backward_1968 0.0226 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3726103Z triton_flex_attention_backward_1959 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3726247Z SingleProcess AUTOTUNE benchmarking takes 0.2677 seconds and 0.8736 seconds precompiling for 22 choices 2025-12-04T09:58:55.3726340Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:55.3726387Z Traceback (most recent call last): 2025-12-04T09:58:55.3726554Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:55.3726594Z self.assertTrue( 2025-12-04T09:58:55.3726701Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:55.3726750Z raise self.failureException(msg) 2025-12-04T09:58:55.3726877Z AssertionError: False is not true : Log file /tmp/tmpyi1436_p/flex_attention_configs.json was not created 2025-12-04T09:58:55.3726881Z 2025-12-04T09:58:55.3726956Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.3727122Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.3727125Z 2025-12-04T09:58:55.3727217Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.3727292Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3727336Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3727372Z unimplemented [] 2025-12-04T09:58:55.3727446Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3728025Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:55.3728125Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3728163Z graph_break [] 2025-12-04T09:58:55.3728239Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3728727Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:55.3728777Z current_size = base.storage().size() 2025-12-04T09:58:55.3728817Z Autotune Choices Stats: 2025-12-04T09:58:55.3729566Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:55.3729707Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3729820Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3729993Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3730607Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3731206Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3731823Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3732424Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3733028Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3733627Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3734243Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3734861Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3735460Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3736114Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3736245Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:55.3736285Z Autotune Choices Stats: 2025-12-04T09:58:55.3737043Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.3737262Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3737425Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3737703Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3738350Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3738990Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3739626Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3740264Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3740896Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3741528Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3742151Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3742793Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3743441Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3744067Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3744195Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:55.3744274Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3744318Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3744355Z unimplemented [] 2025-12-04T09:58:55.3744418Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3744530Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3745102Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3745141Z graph_break [] 2025-12-04T09:58:55.3745215Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3745257Z Autotune Choices Stats: 2025-12-04T09:58:55.3746029Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:55.3746155Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3746269Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3746451Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3747061Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3747695Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3748296Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3748914Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3749519Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3750122Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3750723Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3751337Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3751960Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3752562Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3752693Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:55.3752734Z Autotune Choices Stats: 2025-12-04T09:58:55.3753497Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.3753713Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3753885Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3754163Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3754793Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3755434Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3756119Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3756748Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3757392Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3758010Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3758642Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3759265Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3759903Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3760550Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3760679Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:55.3760757Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3760799Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3760837Z unimplemented [] 2025-12-04T09:58:55.3760897Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3761001Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3761584Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3761623Z graph_break [] 2025-12-04T09:58:55.3761697Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3761737Z Autotune Choices Stats: 2025-12-04T09:58:55.3762479Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:55.3762606Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3762723Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3762886Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3763496Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3764107Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3764728Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3765332Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3766005Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3766608Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3767213Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3767816Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3768429Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3769057Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3769186Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:55.3769227Z Autotune Choices Stats: 2025-12-04T09:58:55.3769979Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.3770211Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3770377Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3770654Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3771290Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3771920Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3772555Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3773198Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3773832Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3774469Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3775092Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3775728Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3776410Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3777048Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3777191Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:55.3777265Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3777310Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3777346Z unimplemented [] 2025-12-04T09:58:55.3777422Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3777523Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3778104Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3778141Z graph_break [] 2025-12-04T09:58:55.3778217Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3778257Z Autotune Choices Stats: 2025-12-04T09:58:55.3779003Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:55.3779133Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3779245Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3779409Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3780023Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3780629Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3781241Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3781869Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3782475Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3783096Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3783698Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3784300Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3784906Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3785513Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3785652Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:55.3785692Z Autotune Choices Stats: 2025-12-04T09:58:55.3786503Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.3786718Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3786885Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3787176Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3787805Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3788432Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3789061Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3789699Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3790349Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3790975Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3791614Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3792244Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3792868Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3793494Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3793636Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:55.3793709Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3793762Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3793799Z unimplemented [] 2025-12-04T09:58:55.3793862Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3793960Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3794553Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3794591Z graph_break [] 2025-12-04T09:58:55.3794668Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3794709Z Autotune Choices Stats: 2025-12-04T09:58:55.3795453Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.3795593Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3795707Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3795878Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3796524Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3797130Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3797734Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3798356Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3798983Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3799585Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3800205Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3800816Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3801416Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3802017Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3802157Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:55.3802196Z Autotune Choices Stats: 2025-12-04T09:58:55.3802967Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.3803194Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3803357Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3803634Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3804280Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3804911Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3805537Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3806192Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3806839Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3807492Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3808120Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3808758Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3809386Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3810012Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3810142Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:55.3810215Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3810269Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3810306Z unimplemented [] 2025-12-04T09:58:55.3810369Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3810470Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3811045Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3811098Z graph_break [] 2025-12-04T09:58:55.3811172Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3811215Z Autotune Choices Stats: 2025-12-04T09:58:55.3811974Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.3812104Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3812218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3812379Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3813000Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3813610Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3814214Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3814820Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3815431Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3816091Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3816696Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3817311Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3817915Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3818519Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3818649Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:55.3818708Z Autotune Choices Stats: 2025-12-04T09:58:55.3819469Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.3819698Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3819871Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3820153Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3820790Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3821435Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3822060Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3822690Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3823317Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3823951Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3824598Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3825230Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3825866Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3826527Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3826656Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:55.3826729Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3826772Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3826810Z unimplemented [] 2025-12-04T09:58:55.3826874Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3826973Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3827549Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3827601Z graph_break [] 2025-12-04T09:58:55.3827674Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3827715Z Autotune Choices Stats: 2025-12-04T09:58:55.3828482Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:55.3828611Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3828724Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3828883Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3829501Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3830117Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3830720Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3831323Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3831939Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3832541Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3833165Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3833776Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3834396Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3834998Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3835128Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:55.3835168Z Autotune Choices Stats: 2025-12-04T09:58:55.3835958Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.3836192Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3836355Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3836646Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3837294Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3837921Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3838558Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3839184Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3839814Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3840443Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3841077Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3841727Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3842359Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3842995Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3843123Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:55.3843199Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3843241Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3843279Z unimplemented [] 2025-12-04T09:58:55.3843339Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3843440Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3844013Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3844052Z graph_break [] 2025-12-04T09:58:55.3844124Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3844177Z Autotune Choices Stats: 2025-12-04T09:58:55.3844917Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:55.3845053Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3845170Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3845339Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3845989Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3846605Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3847213Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3847821Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3848426Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3849042Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3849655Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3850276Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3850877Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3851496Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3851625Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:55.3851668Z Autotune Choices Stats: 2025-12-04T09:58:55.3852428Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.3852645Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3852810Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3853095Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3853730Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3854375Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3854996Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3855640Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3856321Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3856945Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3857582Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3858218Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3858859Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3859482Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3859623Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:55.3859698Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3859741Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3859782Z unimplemented [] 2025-12-04T09:58:55.3859844Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3859946Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3860524Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3860562Z graph_break [] 2025-12-04T09:58:55.3860635Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3860677Z Autotune Choices Stats: 2025-12-04T09:58:55.3861420Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.3861557Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3861673Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3861843Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3862467Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3863069Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3863684Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3864284Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3864895Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3865496Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3866136Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3866754Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3867373Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3870016Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3870180Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:55.3870225Z Autotune Choices Stats: 2025-12-04T09:58:55.3870986Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.3871216Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3871386Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3871661Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3872310Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3872942Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3873582Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3874205Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3874844Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3875469Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3876145Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3876781Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3877422Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3878070Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3878201Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:55.3878279Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3878322Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3878360Z unimplemented [] 2025-12-04T09:58:55.3878422Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3878538Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3879107Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3879146Z graph_break [] 2025-12-04T09:58:55.3879221Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3879262Z Autotune Choices Stats: 2025-12-04T09:58:55.3879997Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:55.3880125Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3880243Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3880414Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3881025Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3881645Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3882248Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3882864Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3883463Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3884069Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3884675Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3885296Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3885918Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3886560Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3886689Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:55.3886730Z Autotune Choices Stats: 2025-12-04T09:58:55.3887514Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:55.3887731Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3887896Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3888174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3888806Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3889447Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3890078Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3890713Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3891355Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3891979Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3892602Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3893228Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3893864Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3894510Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3894639Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:55.3894713Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3894757Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3894795Z unimplemented [] 2025-12-04T09:58:55.3894858Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3894957Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3895544Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3895581Z graph_break [] 2025-12-04T09:58:55.3895657Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3895697Z Autotune Choices Stats: 2025-12-04T09:58:55.3896480Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:55.3896608Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3896720Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3896880Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3897484Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3898107Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3898734Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3899333Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3899951Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3900554Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3901159Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3901760Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3902379Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3903001Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3903132Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:55.3903172Z Autotune Choices Stats: 2025-12-04T09:58:55.3903924Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:55.3904152Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3904316Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3904597Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3905227Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3905852Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3906526Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3907177Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3907803Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3908441Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3909059Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3909687Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3910310Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3910946Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3911084Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:55.3911159Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3911204Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3911242Z unimplemented [] 2025-12-04T09:58:55.3911304Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3911416Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3911992Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3912031Z graph_break [] 2025-12-04T09:58:55.3912107Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3912148Z Autotune Choices Stats: 2025-12-04T09:58:55.3912899Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:55.3913026Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3913138Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3913301Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3913914Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3914519Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3915130Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3915763Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3916396Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3917017Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3917614Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3918222Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3918823Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3919438Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3919579Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:55.3919618Z Autotune Choices Stats: 2025-12-04T09:58:55.3920399Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:55.3920616Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3920779Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3921068Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3921691Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3922321Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3922944Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3923571Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3924215Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3924848Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3925484Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3926147Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3926775Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3927399Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3927545Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:55.3927618Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3927662Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3927711Z unimplemented [] 2025-12-04T09:58:55.3927772Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3927870Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3928458Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.3928497Z graph_break [] 2025-12-04T09:58:55.3928570Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3928611Z Autotune Choices Stats: 2025-12-04T09:58:55.3929354Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:55.3929495Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3929608Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3929767Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3930376Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3931095Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3931701Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3932313Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3932940Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3933547Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3934160Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3934761Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3935361Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3936008Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3936152Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:55.3936192Z Autotune Choices Stats: 2025-12-04T09:58:55.3936949Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.3937190Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3937355Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3937630Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3938280Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3938904Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3939531Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3940158Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3940796Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3941441Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3942067Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3942714Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3943340Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3943962Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3944091Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:55.3944166Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3944221Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3944258Z unimplemented [] 2025-12-04T09:58:55.3944319Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3944418Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3945004Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3945056Z graph_break [] 2025-12-04T09:58:55.3945129Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3945170Z Autotune Choices Stats: 2025-12-04T09:58:55.3945918Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:55.3946081Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3946195Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3946353Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3946977Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3947580Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3948184Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3948785Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3949412Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3950041Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3950641Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3951253Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3951860Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3952461Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3952590Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:55.3952630Z Autotune Choices Stats: 2025-12-04T09:58:55.3953404Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:55.3953633Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3953797Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3954085Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3954714Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3955346Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3956020Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3956642Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3957273Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3957908Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3958561Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3959188Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3959827Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3960454Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3960583Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:55.3960658Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3960699Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3960738Z unimplemented [] 2025-12-04T09:58:55.3960798Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3960899Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3961472Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3961521Z graph_break [] 2025-12-04T09:58:55.3961594Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3961634Z Autotune Choices Stats: 2025-12-04T09:58:55.3962369Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:55.3962520Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3962634Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3962794Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3963403Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3964018Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3964624Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3965226Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3965826Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3966456Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3967089Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3967689Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3968303Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3968907Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3969036Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:55.3969077Z Autotune Choices Stats: 2025-12-04T09:58:55.3969837Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.3970068Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3970232Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3970518Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3971159Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3971783Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3972419Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3973042Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3973676Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3974303Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3974934Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3975584Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3976252Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3976885Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3977013Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:55.3977087Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3977130Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3977167Z unimplemented [] 2025-12-04T09:58:55.3977227Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3977328Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3977911Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3977948Z graph_break [] 2025-12-04T09:58:55.3978021Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3978074Z Autotune Choices Stats: 2025-12-04T09:58:55.3978813Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.3978953Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3979067Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3979229Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3979862Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3980470Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3981085Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3981686Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3982289Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3982906Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3983508Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3984132Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3984737Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3985350Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3985476Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:55.3985517Z Autotune Choices Stats: 2025-12-04T09:58:55.3986310Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.3986528Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3986695Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3986990Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3987621Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3988274Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3988898Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3989549Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3990176Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3990801Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3991429Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3992066Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3992714Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3993341Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3993480Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:55.3993556Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.3993598Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.3993636Z unimplemented [] 2025-12-04T09:58:55.3993696Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.3993796Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.3994371Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.3994409Z graph_break [] 2025-12-04T09:58:55.3994483Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.3994522Z Autotune Choices Stats: 2025-12-04T09:58:55.3995254Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.3995391Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.3995508Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.3995669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.3996343Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3996943Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3997558Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.3998164Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3998767Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3999370Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.3999984Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4000599Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4001207Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4001801Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4001938Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:55.4001979Z Autotune Choices Stats: 2025-12-04T09:58:55.4002744Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.4002964Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4003129Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4003404Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4004037Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4004685Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4005333Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4006010Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4006651Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4007280Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4007904Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4008538Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4009179Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4009815Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4009944Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:55.4010017Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4010060Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4010097Z unimplemented [] 2025-12-04T09:58:55.4010158Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4010259Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4010844Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4010882Z graph_break [] 2025-12-04T09:58:55.4010955Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4010994Z Autotune Choices Stats: 2025-12-04T09:58:55.4011747Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:55.4011878Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4011990Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4012165Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4012777Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4013398Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4014004Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4014617Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4015217Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4015825Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4016460Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4017079Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4017692Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4018312Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4018441Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:55.4018480Z Autotune Choices Stats: 2025-12-04T09:58:55.4019258Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.4019475Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4019638Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4019921Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4020558Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4021193Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4021829Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4022464Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4023092Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4023733Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4024357Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4024986Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4025627Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4026329Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4026458Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:55.4026531Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4026574Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4026610Z unimplemented [] 2025-12-04T09:58:55.4026673Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4026772Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4027349Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4027418Z graph_break [] 2025-12-04T09:58:55.4027492Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4027533Z Autotune Choices Stats: 2025-12-04T09:58:55.4028279Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.4028408Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4028523Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4028684Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4029305Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4029929Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4030576Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4031177Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4031795Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4032397Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4032997Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4033601Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4034217Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4034839Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4034971Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:55.4035011Z Autotune Choices Stats: 2025-12-04T09:58:55.4035770Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.4036046Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4036209Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4036483Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4037118Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4037747Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4038386Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4039032Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4039652Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4040290Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4040912Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4041541Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4042165Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4042805Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4042943Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:55.4043016Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4043060Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4043097Z unimplemented [] 2025-12-04T09:58:55.4043159Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4043275Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4043851Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4043889Z graph_break [] 2025-12-04T09:58:55.4043962Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4044002Z Autotune Choices Stats: 2025-12-04T09:58:55.4044757Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.4044885Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4044998Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4045161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4045774Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4046418Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4047045Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4047670Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4048271Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4048889Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4049496Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4050098Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4050698Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4051310Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4051449Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:55.4051489Z Autotune Choices Stats: 2025-12-04T09:58:55.4052258Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:55.4052476Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4052638Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4052923Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4053557Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4054186Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4054809Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4055442Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4056150Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4056774Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4057410Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4058039Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4058666Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4059287Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4059435Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:55.4059509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4059550Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4059587Z unimplemented [] 2025-12-04T09:58:55.4059677Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4059775Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4060369Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4060408Z graph_break [] 2025-12-04T09:58:55.4060482Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4060523Z Autotune Choices Stats: 2025-12-04T09:58:55.4061260Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.4061391Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4061514Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4061674Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4062283Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4062886Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4063492Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4064107Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4064736Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4065334Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4066005Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4066606Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4067212Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4067814Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4067955Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:55.4067996Z Autotune Choices Stats: 2025-12-04T09:58:55.4068752Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.4068997Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4069162Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4069442Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4070085Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4070700Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4071327Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4071953Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4072591Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4073237Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4073868Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4074496Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4075117Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4075737Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4075869Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:55.4075982Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4076025Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4076084Z unimplemented [] 2025-12-04T09:58:55.4076145Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4076247Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4076822Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4076877Z graph_break [] 2025-12-04T09:58:55.4076950Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4076990Z Autotune Choices Stats: 2025-12-04T09:58:55.4077744Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:55.4077872Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4077986Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4078146Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4078781Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4079383Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4079985Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4080592Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4081201Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4081825Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4082431Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4083052Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4083651Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4084253Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4084380Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:55.4084422Z Autotune Choices Stats: 2025-12-04T09:58:55.4085190Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:55.4085425Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4085590Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4085877Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4086536Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4087186Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4087811Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4088433Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4089067Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4089712Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4090361Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4090984Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4091643Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4092271Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4092398Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:55.4092472Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4092515Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4092553Z unimplemented [] 2025-12-04T09:58:55.4092614Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4092714Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4093279Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4093326Z graph_break [] 2025-12-04T09:58:55.4093403Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4093442Z Autotune Choices Stats: 2025-12-04T09:58:55.4094187Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:55.4094334Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4094450Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4094610Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4095224Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4095839Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4096480Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4097079Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4097680Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4098299Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4098929Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4099533Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4100149Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4100751Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4100878Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:55.4100922Z Autotune Choices Stats: 2025-12-04T09:58:55.4101673Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.4101903Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4102068Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4102359Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4103005Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4103627Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4104261Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4104891Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4105523Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4106175Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4106814Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4107474Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4108099Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4108742Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4108870Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:55.4108944Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4108988Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4109027Z unimplemented [] 2025-12-04T09:58:55.4109086Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4109187Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4109758Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4109796Z graph_break [] 2025-12-04T09:58:55.4109871Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4109911Z Autotune Choices Stats: 2025-12-04T09:58:55.4110661Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:55.4110798Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4110913Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4111076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4111702Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4112308Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4112921Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4113526Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4114131Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4114749Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4115356Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4116018Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4116620Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4117238Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4117367Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:55.4117407Z Autotune Choices Stats: 2025-12-04T09:58:55.4118168Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.4118386Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4118549Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4118837Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4119471Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4120121Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4120747Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4121381Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4122011Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4122643Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4123268Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4123906Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4124560Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4125191Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4125332Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:55.4125407Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4125450Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4125488Z unimplemented [] 2025-12-04T09:58:55.4125550Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4125650Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4126247Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4126286Z graph_break [] 2025-12-04T09:58:55.4126363Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4126403Z Autotune Choices Stats: 2025-12-04T09:58:55.4127151Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.4127295Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4127408Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4127569Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4128203Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4128810Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4129436Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4130040Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4130650Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4131254Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4131871Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4132481Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4133093Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4133700Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4133840Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:55.4133880Z Autotune Choices Stats: 2025-12-04T09:58:55.4134634Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.4134854Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4135016Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4135294Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4135972Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4136610Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4137260Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4137885Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4138532Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4139164Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4139790Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4140445Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4141080Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4141716Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4141846Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:55.4141919Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4141963Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4142000Z unimplemented [] 2025-12-04T09:58:55.4142060Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4142161Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4142751Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4142790Z graph_break [] 2025-12-04T09:58:55.4142864Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4142904Z Autotune Choices Stats: 2025-12-04T09:58:55.4143643Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:55.4143771Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4143884Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4144056Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4144666Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4145287Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4145892Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4146537Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4147142Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4147741Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4148344Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4148961Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4149594Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4150199Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4150329Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:55.4150368Z Autotune Choices Stats: 2025-12-04T09:58:55.4151136Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.4151354Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4151518Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4151800Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4152434Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4153064Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4153710Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4154346Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4154991Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4155616Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4156282Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4156912Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4157556Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4158207Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4158337Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:55.4158410Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4158454Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4158494Z unimplemented [] 2025-12-04T09:58:55.4158557Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4158656Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4159243Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4159282Z graph_break [] 2025-12-04T09:58:55.4159356Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4159396Z Autotune Choices Stats: 2025-12-04T09:58:55.4160129Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.4160258Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4160372Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4160534Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4161146Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4161756Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4162386Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4162991Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4163602Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4164204Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4164807Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4165412Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4166060Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4166692Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4166821Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:55.4166861Z Autotune Choices Stats: 2025-12-04T09:58:55.4167617Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.4167849Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4168013Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4168292Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4168929Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4169558Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4170186Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4170834Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4171467Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4172104Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4172728Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4173360Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4173993Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4174626Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4174765Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:55.4174842Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4174884Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4174921Z unimplemented [] 2025-12-04T09:58:55.4174992Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4175094Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4175670Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4175708Z graph_break [] 2025-12-04T09:58:55.4175782Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4175824Z Autotune Choices Stats: 2025-12-04T09:58:55.4176629Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:55.4176758Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4176872Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4177035Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4177643Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4178250Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4178864Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4179487Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4180088Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4180699Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4181306Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4181913Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4182521Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4183134Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4183273Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:55.4183314Z Autotune Choices Stats: 2025-12-04T09:58:55.4184093Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.4184312Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4184478Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4184773Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4185406Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4186060Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4186678Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4187320Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4187979Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4188606Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4189247Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4189879Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4190508Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4191134Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4191273Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:55.4191349Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4191402Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4191440Z unimplemented [] 2025-12-04T09:58:55.4191499Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4191599Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4192184Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4192226Z graph_break [] 2025-12-04T09:58:55.4192299Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4192339Z Autotune Choices Stats: 2025-12-04T09:58:55.4193085Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:55.4193224Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4193338Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4193501Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4194113Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4194714Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4195318Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4195959Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4196593Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4197201Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4197819Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4198430Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4199035Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4199641Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4199780Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:55.4199821Z Autotune Choices Stats: 2025-12-04T09:58:55.4200599Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.4200827Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4200992Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4201271Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4201916Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4202547Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4203174Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4203801Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4204445Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4205097Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4205728Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4206385Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4207021Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4207645Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4207773Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:55.4207865Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4207908Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4207946Z unimplemented [] 2025-12-04T09:58:55.4208006Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4208106Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4208686Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4208736Z graph_break [] 2025-12-04T09:58:55.4208811Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4208851Z Autotune Choices Stats: 2025-12-04T09:58:55.4209602Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:55.4209728Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4209844Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4210006Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4210628Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4211239Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4211844Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4212447Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4213066Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4213697Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4214302Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4214915Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4215516Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4216175Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4216302Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:55.4216358Z Autotune Choices Stats: 2025-12-04T09:58:55.4217120Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.4217353Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4217534Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4217811Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4218453Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4219099Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4219725Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4220350Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4220984Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4221619Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4222263Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4222893Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4223530Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4224160Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4224289Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:55.4224362Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4224409Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4224446Z unimplemented [] 2025-12-04T09:58:55.4224508Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4224608Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4225187Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4225240Z graph_break [] 2025-12-04T09:58:55.4225315Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4225372Z Autotune Choices Stats: 2025-12-04T09:58:55.4226173Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.4226302Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4226415Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4226575Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4227201Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4227809Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4228412Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4229013Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4229632Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4230246Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4230864Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4231470Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4232086Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4232688Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4232817Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:55.4232856Z Autotune Choices Stats: 2025-12-04T09:58:55.4233619Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.4233848Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4234014Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4234302Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4234959Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4235580Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4236261Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4236885Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4237513Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4238144Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4238775Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4239429Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4240056Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4240699Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4240828Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:55.4240902Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4240946Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4240983Z unimplemented [] 2025-12-04T09:58:55.4241047Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4241147Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4241724Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4241761Z graph_break [] 2025-12-04T09:58:55.4241848Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4241887Z Autotune Choices Stats: 2025-12-04T09:58:55.4242633Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:55.4242770Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4242883Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4243058Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4243670Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4244284Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4244895Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4245499Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4246144Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4246764Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4247389Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4247987Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4248596Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4249203Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4249334Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:55.4249375Z Autotune Choices Stats: 2025-12-04T09:58:55.4250137Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.4250353Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4250528Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4250806Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4251457Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4252094Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4252722Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4253362Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4254001Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4254634Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4255271Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4255955Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4256583Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4257223Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4257352Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:55.4257427Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4257472Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4257510Z unimplemented [] 2025-12-04T09:58:55.4257572Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4257670Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4258253Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4258291Z graph_break [] 2025-12-04T09:58:55.4258363Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4258407Z Autotune Choices Stats: 2025-12-04T09:58:55.4259147Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:55.4259288Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4259400Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4259573Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4260197Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4260797Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4261409Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4262014Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4262628Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4263228Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4263846Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4264471Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4265084Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4265695Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4265826Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:55.4265868Z Autotune Choices Stats: 2025-12-04T09:58:55.4266660Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.4266879Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4267047Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4267327Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4267982Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4268630Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4269264Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4269911Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4270534Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4271159Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4271786Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4272427Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4273074Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4273703Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4273834Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:55.4273911Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4273953Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4273991Z unimplemented [] 2025-12-04T09:58:55.4274052Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4274165Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4274738Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4274778Z graph_break [] 2025-12-04T09:58:55.4274852Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4274895Z Autotune Choices Stats: 2025-12-04T09:58:55.4275639Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:55.4275766Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4275892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4276084Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4276696Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4277330Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4277933Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4278556Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4279159Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4279760Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4280367Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4280986Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4281612Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4282215Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4282345Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:55.4282386Z Autotune Choices Stats: 2025-12-04T09:58:55.4283171Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.4283388Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4283554Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4283832Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4284464Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4285104Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4285748Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4286400Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4287047Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4287677Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4288304Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4288926Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4289574Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4290222Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4290351Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:55.4290427Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4290470Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4290507Z unimplemented [] 2025-12-04T09:58:55.4290567Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4290667Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4291251Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4291291Z graph_break [] 2025-12-04T09:58:55.4291365Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4291409Z Autotune Choices Stats: 2025-12-04T09:58:55.4292156Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.4292283Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4292399Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4292560Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4293173Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4293794Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4294417Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4295020Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4295639Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4296285Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4296889Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4297491Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4298109Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4298748Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4298877Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:55.4298920Z Autotune Choices Stats: 2025-12-04T09:58:55.4299695Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.4299913Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4300080Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4300361Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4300995Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4301621Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4302256Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4302910Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4303543Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4304180Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4304805Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4305442Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4306111Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4306752Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4306893Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:55.4306970Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4307027Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4307066Z unimplemented [] 2025-12-04T09:58:55.4307127Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4307228Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4307801Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4307843Z graph_break [] 2025-12-04T09:58:55.4307919Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4307958Z Autotune Choices Stats: 2025-12-04T09:58:55.4308708Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:55.4308834Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4308949Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4309110Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4309719Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4310319Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4310925Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4311552Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4312162Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4312773Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4313380Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4313986Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4314593Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4315207Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4315350Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:55.4315393Z Autotune Choices Stats: 2025-12-04T09:58:55.4316202Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.4316417Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4316601Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4316877Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4317514Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4318146Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4318775Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4319412Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4320060Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4320697Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4321332Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4321960Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4322586Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4323213Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4323351Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:55.4323434Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4323477Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4323514Z unimplemented [] 2025-12-04T09:58:55.4323577Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4323677Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4324265Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4324304Z graph_break [] 2025-12-04T09:58:55.4324381Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4324420Z Autotune Choices Stats: 2025-12-04T09:58:55.4325164Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:55.4325293Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4325406Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4325567Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4326214Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4326819Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4327433Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4328050Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4328668Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4329272Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4329886Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4330489Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4331091Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4331697Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4331844Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:55.4331894Z Autotune Choices Stats: 2025-12-04T09:58:55.4332659Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.4332875Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4333040Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4333316Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4333966Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4334592Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4335215Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4335841Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4336522Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4337177Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4337795Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4338444Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4339073Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4339699Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4339829Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:55.4339917Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4339961Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4339999Z unimplemented [] 2025-12-04T09:58:55.4340060Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4340159Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4340737Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4340786Z graph_break [] 2025-12-04T09:58:55.4340861Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4340911Z Autotune Choices Stats: 2025-12-04T09:58:55.4341647Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1708", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.4341774Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4341890Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4342065Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4342680Z triton_flex_attention_1708 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4343292Z triton_flex_attention_1709 0.0109 ms 96.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4343897Z triton_flex_attention_1707 0.0117 ms 89.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4344515Z triton_flex_attention_1705 0.0130 ms 80.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4345135Z triton_flex_attention_1724 0.0135 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4345751Z triton_flex_attention_1706 0.0136 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4346388Z triton_flex_attention_1716 0.0142 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4347010Z triton_flex_attention_1722 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4347617Z triton_flex_attention_1714 0.0149 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4348223Z triton_flex_attention_1720 0.0162 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4348366Z SingleProcess AUTOTUNE benchmarking takes 0.2434 seconds and 0.4106 seconds precompiling for 24 choices 2025-12-04T09:58:55.4348406Z Autotune Choices Stats: 2025-12-04T09:58:55.4349164Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015359999611973763, "best_triton_pos": 0} 2025-12-04T09:58:55.4349395Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4349575Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4349852Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4350486Z triton_flex_attention_backward_1743 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4351117Z triton_flex_attention_backward_1737 0.0181 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4351748Z triton_flex_attention_backward_1734 0.0187 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4352372Z triton_flex_attention_backward_1735 0.0188 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4353011Z triton_flex_attention_backward_1745 0.0203 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4353648Z triton_flex_attention_backward_1744 0.0203 ms 75.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4354289Z triton_flex_attention_backward_1742 0.0218 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4354920Z triton_flex_attention_backward_1747 0.0220 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4355557Z triton_flex_attention_backward_1738 0.0228 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4356225Z triton_flex_attention_backward_1729 0.0230 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4356356Z SingleProcess AUTOTUNE benchmarking takes 0.2527 seconds and 0.7882 seconds precompiling for 22 choices 2025-12-04T09:58:55.4356430Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4356474Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4356510Z unimplemented [] 2025-12-04T09:58:55.4356571Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4356669Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4357259Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4357297Z graph_break [] 2025-12-04T09:58:55.4357383Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4357423Z Autotune Choices Stats: 2025-12-04T09:58:55.4358178Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1754", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009398999623954296, "best_triton_pos": 0} 2025-12-04T09:58:55.4358306Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4358422Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4358583Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4359217Z triton_flex_attention_1754 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4359819Z triton_flex_attention_1755 0.0104 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4360424Z triton_flex_attention_1752 0.0112 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4361028Z triton_flex_attention_1753 0.0117 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4361641Z triton_flex_attention_1750 0.0120 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4362275Z triton_flex_attention_1770 0.0132 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4362883Z triton_flex_attention_1751 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4363498Z triton_flex_attention_1762 0.0140 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4364100Z triton_flex_attention_1768 0.0146 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4364704Z triton_flex_attention_1760 0.0150 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4364836Z SingleProcess AUTOTUNE benchmarking takes 0.2227 seconds and 0.4678 seconds precompiling for 24 choices 2025-12-04T09:58:55.4364876Z Autotune Choices Stats: 2025-12-04T09:58:55.4365638Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.4365863Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4366080Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4366356Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4369891Z triton_flex_attention_backward_1789 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4370516Z triton_flex_attention_backward_1783 0.0184 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4371163Z triton_flex_attention_backward_1780 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4371789Z triton_flex_attention_backward_1781 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4372421Z triton_flex_attention_backward_1791 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4373062Z triton_flex_attention_backward_1790 0.0204 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4373706Z triton_flex_attention_backward_1788 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4374334Z triton_flex_attention_backward_1793 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4374972Z triton_flex_attention_backward_1784 0.0226 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4375596Z triton_flex_attention_backward_1775 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4375725Z SingleProcess AUTOTUNE benchmarking takes 0.2632 seconds and 0.8758 seconds precompiling for 22 choices 2025-12-04T09:58:55.4375801Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4375842Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4375879Z unimplemented [] 2025-12-04T09:58:55.4375983Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4376082Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4376655Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4376706Z graph_break [] 2025-12-04T09:58:55.4376780Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4376821Z Autotune Choices Stats: 2025-12-04T09:58:55.4377561Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1801", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.4377701Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4377828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4377987Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4378598Z triton_flex_attention_1801 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4379211Z triton_flex_attention_1800 0.0108 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4379816Z triton_flex_attention_1816 0.0128 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4381885Z triton_flex_attention_1798 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4382496Z triton_flex_attention_1797 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4383121Z triton_flex_attention_1808 0.0133 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4383754Z triton_flex_attention_1814 0.0140 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4384353Z triton_flex_attention_1806 0.0150 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4384970Z triton_flex_attention_1799 0.0158 ms 66.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4385576Z triton_flex_attention_1812 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4385707Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4169 seconds precompiling for 24 choices 2025-12-04T09:58:55.4385748Z Autotune Choices Stats: 2025-12-04T09:58:55.4386556Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1835", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.4386775Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4386959Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4387241Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4387896Z triton_flex_attention_backward_1835 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4388521Z triton_flex_attention_backward_1829 0.0184 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4389157Z triton_flex_attention_backward_1826 0.0186 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4389787Z triton_flex_attention_backward_1827 0.0186 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4390418Z triton_flex_attention_backward_1837 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4391050Z triton_flex_attention_backward_1836 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4391690Z triton_flex_attention_backward_1834 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4392339Z triton_flex_attention_backward_1839 0.0221 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4392967Z triton_flex_attention_backward_1830 0.0228 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4393606Z triton_flex_attention_backward_1821 0.0230 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4393736Z SingleProcess AUTOTUNE benchmarking takes 0.2624 seconds and 0.8439 seconds precompiling for 22 choices 2025-12-04T09:58:55.4393814Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4393856Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4393894Z unimplemented [] 2025-12-04T09:58:55.4393954Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4394057Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4394636Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4394674Z graph_break [] 2025-12-04T09:58:55.4394748Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4394790Z Autotune Choices Stats: 2025-12-04T09:58:55.4395535Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1846", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009399999864399433, "best_triton_pos": 0} 2025-12-04T09:58:55.4395675Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4395799Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4395993Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4396627Z triton_flex_attention_1846 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4397231Z triton_flex_attention_1844 0.0102 ms 91.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4397847Z triton_flex_attention_1845 0.0120 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4398452Z triton_flex_attention_1843 0.0130 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4399057Z triton_flex_attention_1854 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4399662Z triton_flex_attention_1862 0.0134 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4400281Z triton_flex_attention_1842 0.0137 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4400910Z triton_flex_attention_1847 0.0138 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4401513Z triton_flex_attention_1860 0.0144 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4402127Z triton_flex_attention_1852 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4402254Z SingleProcess AUTOTUNE benchmarking takes 0.2274 seconds and 0.3833 seconds precompiling for 24 choices 2025-12-04T09:58:55.4402296Z Autotune Choices Stats: 2025-12-04T09:58:55.4403063Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01583999954164028, "best_triton_pos": 0} 2025-12-04T09:58:55.4403282Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4403448Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4403725Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4404365Z triton_flex_attention_backward_1881 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4405014Z triton_flex_attention_backward_1875 0.0184 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4405637Z triton_flex_attention_backward_1873 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4406305Z triton_flex_attention_backward_1872 0.0188 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4406937Z triton_flex_attention_backward_1883 0.0201 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4407567Z triton_flex_attention_backward_1882 0.0202 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4408185Z triton_flex_attention_backward_1880 0.0220 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4408825Z triton_flex_attention_backward_1885 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4409485Z triton_flex_attention_backward_1876 0.0224 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4410112Z triton_flex_attention_backward_1867 0.0232 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4410242Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.7872 seconds precompiling for 22 choices 2025-12-04T09:58:55.4410318Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4410360Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4410411Z unimplemented [] 2025-12-04T09:58:55.4410473Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4410575Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4411152Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4411190Z graph_break [] 2025-12-04T09:58:55.4411265Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4411304Z Autotune Choices Stats: 2025-12-04T09:58:55.4412050Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1893", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.4412185Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4412302Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4412464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4413073Z triton_flex_attention_1893 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4413695Z triton_flex_attention_1892 0.0106 ms 95.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4414294Z triton_flex_attention_1891 0.0117 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4414909Z triton_flex_attention_1890 0.0127 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4415516Z triton_flex_attention_1908 0.0130 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4416157Z triton_flex_attention_1889 0.0132 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4416763Z triton_flex_attention_1900 0.0135 ms 74.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4417375Z triton_flex_attention_1906 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4417998Z triton_flex_attention_1898 0.0148 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4418599Z triton_flex_attention_1904 0.0162 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4418728Z SingleProcess AUTOTUNE benchmarking takes 0.2478 seconds and 0.5052 seconds precompiling for 24 choices 2025-12-04T09:58:55.4418771Z Autotune Choices Stats: 2025-12-04T09:58:55.4419535Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1927", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.4419754Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4419921Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4420194Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4420835Z triton_flex_attention_backward_1927 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4421469Z triton_flex_attention_backward_1921 0.0183 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4422116Z triton_flex_attention_backward_1918 0.0185 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4422742Z triton_flex_attention_backward_1919 0.0186 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4423381Z triton_flex_attention_backward_1929 0.0201 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4424007Z triton_flex_attention_backward_1928 0.0202 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4424627Z triton_flex_attention_backward_1926 0.0217 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4425257Z triton_flex_attention_backward_1931 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4425893Z triton_flex_attention_backward_1922 0.0227 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4426573Z triton_flex_attention_backward_1913 0.0230 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4426700Z SingleProcess AUTOTUNE benchmarking takes 0.2709 seconds and 0.9165 seconds precompiling for 22 choices 2025-12-04T09:58:55.4426776Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4426818Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4426856Z unimplemented [] 2025-12-04T09:58:55.4426916Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4427014Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4427610Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4427648Z graph_break [] 2025-12-04T09:58:55.4427721Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4427762Z Autotune Choices Stats: 2025-12-04T09:58:55.4428500Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1938", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009960000403225422, "best_triton_pos": 0} 2025-12-04T09:58:55.4428627Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4428742Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4428899Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4429515Z triton_flex_attention_1938 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4430125Z triton_flex_attention_1936 0.0100 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4430758Z triton_flex_attention_1939 0.0101 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4431360Z triton_flex_attention_1935 0.0129 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4431975Z triton_flex_attention_1937 0.0134 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4432573Z triton_flex_attention_1946 0.0137 ms 72.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4433182Z triton_flex_attention_1954 0.0139 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4433790Z triton_flex_attention_1952 0.0146 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4434402Z triton_flex_attention_1944 0.0151 ms 66.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4435024Z triton_flex_attention_1950 0.0165 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4435153Z SingleProcess AUTOTUNE benchmarking takes 0.2498 seconds and 0.4270 seconds precompiling for 24 choices 2025-12-04T09:58:55.4435193Z Autotune Choices Stats: 2025-12-04T09:58:55.4435992Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1973", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.4436210Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4436377Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4436656Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4437287Z triton_flex_attention_backward_1973 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4437916Z triton_flex_attention_backward_1967 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4438558Z triton_flex_attention_backward_1964 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4439208Z triton_flex_attention_backward_1965 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4439835Z triton_flex_attention_backward_1975 0.0199 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4440478Z triton_flex_attention_backward_1974 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4441105Z triton_flex_attention_backward_1972 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4441733Z triton_flex_attention_backward_1977 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4442359Z triton_flex_attention_backward_1968 0.0226 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4442995Z triton_flex_attention_backward_1959 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4443134Z SingleProcess AUTOTUNE benchmarking takes 0.2677 seconds and 0.8736 seconds precompiling for 22 choices 2025-12-04T09:58:55.4443216Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4443261Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4443297Z unimplemented [] 2025-12-04T09:58:55.4443359Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4443457Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4444033Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4444069Z graph_break [] 2025-12-04T09:58:55.4444143Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4444182Z Autotune Choices Stats: 2025-12-04T09:58:55.4444939Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1984", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009600000455975533, "best_triton_pos": 0} 2025-12-04T09:58:55.4445067Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4445181Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4445341Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4445989Z triton_flex_attention_1984 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4446605Z triton_flex_attention_1982 0.0101 ms 94.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4447225Z triton_flex_attention_1983 0.0116 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4447845Z triton_flex_attention_2000 0.0130 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4448449Z triton_flex_attention_1985 0.0132 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4449063Z triton_flex_attention_1981 0.0133 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4449673Z triton_flex_attention_1992 0.0137 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4450276Z triton_flex_attention_1998 0.0140 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4450890Z triton_flex_attention_1990 0.0150 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4451493Z triton_flex_attention_1996 0.0162 ms 59.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4451641Z SingleProcess AUTOTUNE benchmarking takes 0.2470 seconds and 0.3620 seconds precompiling for 24 choices 2025-12-04T09:58:55.4451681Z Autotune Choices Stats: 2025-12-04T09:58:55.4452441Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2019", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.4452658Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4452833Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4453109Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4453747Z triton_flex_attention_backward_2019 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4454377Z triton_flex_attention_backward_2013 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4455002Z triton_flex_attention_backward_2010 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4455631Z triton_flex_attention_backward_2011 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4456336Z triton_flex_attention_backward_2021 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4456965Z triton_flex_attention_backward_2020 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4457603Z triton_flex_attention_backward_2018 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4458235Z triton_flex_attention_backward_2023 0.0222 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4458863Z triton_flex_attention_backward_2014 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4459500Z triton_flex_attention_backward_2005 0.0232 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4459640Z SingleProcess AUTOTUNE benchmarking takes 0.2594 seconds and 0.8715 seconds precompiling for 22 choices 2025-12-04T09:58:55.4459733Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:55.4459780Z Traceback (most recent call last): 2025-12-04T09:58:55.4459931Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:55.4459972Z self.assertTrue( 2025-12-04T09:58:55.4460076Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:55.4460139Z raise self.failureException(msg) 2025-12-04T09:58:55.4460266Z AssertionError: False is not true : Log file /tmp/tmpguk9qo1r/flex_attention_configs.json was not created 2025-12-04T09:58:55.4460269Z 2025-12-04T09:58:55.4460346Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.4460509Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.4460513Z 2025-12-04T09:58:55.4460604Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.4460680Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4460723Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4460759Z unimplemented [] 2025-12-04T09:58:55.4460822Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4461411Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:55.4461512Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4461549Z graph_break [] 2025-12-04T09:58:55.4461623Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4462110Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:55.4462159Z current_size = base.storage().size() 2025-12-04T09:58:55.4462200Z Autotune Choices Stats: 2025-12-04T09:58:55.4462948Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:55.4463095Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4463209Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4463371Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4463982Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4464607Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4465211Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4465822Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4466450Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4467050Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4467655Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4468269Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4468887Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4469490Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4469622Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:55.4469664Z Autotune Choices Stats: 2025-12-04T09:58:55.4470427Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.4470646Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4470811Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4471085Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4471727Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4472361Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4473001Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4473623Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4474265Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4474890Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4475510Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4476176Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4476814Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4477461Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4477588Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:55.4477664Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4477706Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4477744Z unimplemented [] 2025-12-04T09:58:55.4477804Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4477905Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4478497Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4478539Z graph_break [] 2025-12-04T09:58:55.4478611Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4478652Z Autotune Choices Stats: 2025-12-04T09:58:55.4479385Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:55.4479511Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4479625Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4479784Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4480404Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4481015Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4481624Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4482226Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4482841Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4483444Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4484044Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4484645Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4485258Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4485878Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4486047Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:55.4486087Z Autotune Choices Stats: 2025-12-04T09:58:55.4486849Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.4487065Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4487231Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4487506Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4488136Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4488759Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4489395Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4490041Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4490666Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4491301Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4491927Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4492554Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4493177Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4493813Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4493955Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:55.4494049Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4494092Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4494129Z unimplemented [] 2025-12-04T09:58:55.4494189Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4494289Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4494863Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4494900Z graph_break [] 2025-12-04T09:58:55.4494975Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4495014Z Autotune Choices Stats: 2025-12-04T09:58:55.4495755Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:55.4495882Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4496042Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4496204Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4496819Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4497442Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4498052Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4498659Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4499259Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4499873Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4500478Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4501081Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4501683Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4502296Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4502433Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:55.4502484Z Autotune Choices Stats: 2025-12-04T09:58:55.4503244Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.4503460Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4503635Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4503912Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4504545Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4505174Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4505800Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4506464Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4507125Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4507756Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4508393Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4509021Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4509650Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4510275Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4510416Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:55.4510509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4510553Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4510591Z unimplemented [] 2025-12-04T09:58:55.4510652Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4510752Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4511337Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4511374Z graph_break [] 2025-12-04T09:58:55.4511449Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4511489Z Autotune Choices Stats: 2025-12-04T09:58:55.4512238Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:55.4512368Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4512482Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4512642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4513255Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4513863Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4514479Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4515083Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4515695Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4516341Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4516955Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4517556Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4518153Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4518767Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4518909Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:55.4518949Z Autotune Choices Stats: 2025-12-04T09:58:55.4519723Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.4519940Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4520105Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4520388Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4521031Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4521656Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4522281Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4522908Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4523545Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4524193Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4524816Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4525457Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4526123Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4526750Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4526901Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:55.4526976Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4527020Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4527056Z unimplemented [] 2025-12-04T09:58:55.4527117Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4527217Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4527805Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4527845Z graph_break [] 2025-12-04T09:58:55.4527932Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4527974Z Autotune Choices Stats: 2025-12-04T09:58:55.4528714Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.4528843Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4528957Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4529130Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4529746Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4530349Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4530947Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4531551Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4532165Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4532769Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4533380Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4533983Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4534584Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4535188Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4535326Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:55.4535366Z Autotune Choices Stats: 2025-12-04T09:58:55.4536172Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.4536404Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4536581Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4536858Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4537492Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4538129Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4538754Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4539376Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4540016Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4540652Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4541291Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4541930Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4542556Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4543186Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4543314Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:55.4543389Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4543432Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4543468Z unimplemented [] 2025-12-04T09:58:55.4543528Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4543626Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4544212Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4544259Z graph_break [] 2025-12-04T09:58:55.4544332Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4544373Z Autotune Choices Stats: 2025-12-04T09:58:55.4545119Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.4545248Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4545364Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4545526Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4546195Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4546793Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4547401Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4548006Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4548614Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4549237Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4549843Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4550453Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4551054Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4551661Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4551790Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:55.4551831Z Autotune Choices Stats: 2025-12-04T09:58:55.4552587Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.4552812Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4552988Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4553265Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4553914Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4554546Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4555169Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4555789Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4556458Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4557106Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4557844Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4558477Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4559120Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4559743Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4559872Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:55.4559949Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4559990Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4560029Z unimplemented [] 2025-12-04T09:58:55.4560089Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4560192Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4560773Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4560821Z graph_break [] 2025-12-04T09:58:55.4560899Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4560940Z Autotune Choices Stats: 2025-12-04T09:58:55.4561670Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:55.4561807Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4561933Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4562092Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4562704Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4563318Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4563919Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4564520Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4565128Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4565743Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4566405Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4567007Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4567624Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4568225Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4568354Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:55.4568402Z Autotune Choices Stats: 2025-12-04T09:58:55.4569160Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.4569378Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4569559Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4569835Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4570487Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4571112Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4571746Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4572373Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4573002Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4573628Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4574262Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4574917Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4575542Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4576227Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4576357Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:55.4576435Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4576476Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4576514Z unimplemented [] 2025-12-04T09:58:55.4576574Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4576677Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4577256Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4577293Z graph_break [] 2025-12-04T09:58:55.4577366Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4577410Z Autotune Choices Stats: 2025-12-04T09:58:55.4578152Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:55.4578291Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4578419Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4578580Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4579204Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4579806Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4580417Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4581024Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4581628Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4582230Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4582843Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4583466Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4584066Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4584679Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4584808Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:55.4584849Z Autotune Choices Stats: 2025-12-04T09:58:55.4585614Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.4585831Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4586034Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4586309Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4586955Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4587609Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4588231Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4588863Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4589493Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4590121Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4590746Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4591395Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4592044Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4592669Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4592797Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:55.4592875Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4592917Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4592965Z unimplemented [] 2025-12-04T09:58:55.4593027Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4593128Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4593704Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4593742Z graph_break [] 2025-12-04T09:58:55.4593818Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4593857Z Autotune Choices Stats: 2025-12-04T09:58:55.4594599Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.4594734Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4594850Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4595013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4595623Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4596293Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4596897Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4597514Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4598123Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4598731Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4599337Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4599951Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4600578Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4601175Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4601306Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:55.4601347Z Autotune Choices Stats: 2025-12-04T09:58:55.4602118Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.4602333Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4602503Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4602781Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4603415Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4604047Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4604692Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4605322Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4605988Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4606613Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4607246Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4607872Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4608515Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4609164Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4609296Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:55.4609371Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4609414Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4609452Z unimplemented [] 2025-12-04T09:58:55.4609515Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4609614Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4610202Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4610239Z graph_break [] 2025-12-04T09:58:55.4610316Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4610355Z Autotune Choices Stats: 2025-12-04T09:58:55.4611098Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:55.4611226Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4611340Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4611501Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4612118Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4612731Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4613350Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4613956Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4614572Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4615176Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4615776Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4616418Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4617038Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4617669Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4617799Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:55.4617839Z Autotune Choices Stats: 2025-12-04T09:58:55.4618609Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:55.4618825Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4618995Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4619272Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4619904Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4620529Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4621167Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4621820Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4622445Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4623086Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4623711Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4624342Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4624973Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4625616Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4625756Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:55.4625844Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4625887Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4625985Z unimplemented [] 2025-12-04T09:58:55.4626047Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4626145Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4626723Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4626762Z graph_break [] 2025-12-04T09:58:55.4626837Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4626876Z Autotune Choices Stats: 2025-12-04T09:58:55.4627628Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:55.4627756Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4627871Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4628031Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4628640Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4629255Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4629875Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4630489Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4631093Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4631711Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4632321Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4632922Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4633525Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4634137Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4634276Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:55.4634317Z Autotune Choices Stats: 2025-12-04T09:58:55.4635088Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:55.4635306Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4635483Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4635761Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4636427Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4637050Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4637672Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4638308Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4638971Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4639597Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4640231Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4640870Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4641497Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4642122Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4642261Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:55.4642346Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4642388Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4642425Z unimplemented [] 2025-12-04T09:58:55.4642486Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4642584Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4643173Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4643211Z graph_break [] 2025-12-04T09:58:55.4643286Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4643329Z Autotune Choices Stats: 2025-12-04T09:58:55.4644090Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:55.4644218Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4644332Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4644491Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4645107Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4645715Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4646362Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4646977Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4647601Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4648204Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4648835Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4649442Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4650047Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4650665Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4650795Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:55.4650847Z Autotune Choices Stats: 2025-12-04T09:58:55.4651616Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:55.4651836Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4652002Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4652279Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4652923Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4653548Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4654174Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4654798Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4655440Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4656136Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4656763Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4657404Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4658029Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4658657Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4658799Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:55.4658875Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4658920Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4658963Z unimplemented [] 2025-12-04T09:58:55.4659024Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4659125Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4659701Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4659752Z graph_break [] 2025-12-04T09:58:55.4659825Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4659878Z Autotune Choices Stats: 2025-12-04T09:58:55.4660608Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:55.4660735Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4660851Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4661018Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4661628Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4662234Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4662834Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4663451Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4664068Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4664673Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4665292Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4665893Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4666533Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4667130Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4667275Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:55.4667315Z Autotune Choices Stats: 2025-12-04T09:58:55.4668075Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.4668312Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4668490Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4668766Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4669400Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4670046Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4670672Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4671294Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4671937Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4672572Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4673205Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4673831Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4674470Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4675093Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4675222Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:55.4675298Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4675340Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4675377Z unimplemented [] 2025-12-04T09:58:55.4675436Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4675537Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4676144Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4676195Z graph_break [] 2025-12-04T09:58:55.4676268Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4676309Z Autotune Choices Stats: 2025-12-04T09:58:55.4677063Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:55.4677190Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4677305Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4677463Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4678085Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4678696Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4679301Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4679897Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4680512Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4681144Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4681743Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4682355Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4682959Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4683568Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4683695Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:55.4683736Z Autotune Choices Stats: 2025-12-04T09:58:55.4684491Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:55.4684719Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4684896Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4685176Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4685822Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4686481Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4687120Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4687747Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4688373Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4689016Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4689666Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4690293Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4690930Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4691554Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4691682Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:55.4691758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4691800Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4691838Z unimplemented [] 2025-12-04T09:58:55.4691899Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4691999Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4692569Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4692618Z graph_break [] 2025-12-04T09:58:55.4692691Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4692733Z Autotune Choices Stats: 2025-12-04T09:58:55.4693472Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:55.4693609Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4693735Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4693896Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4694511Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4695124Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4695728Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4696371Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4696977Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4697594Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4698211Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4698820Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4699431Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4700033Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4700162Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:55.4700202Z Autotune Choices Stats: 2025-12-04T09:58:55.4700964Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.4701182Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4701358Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4701630Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4702286Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4702912Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4703545Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4704169Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4704799Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4705431Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4706098Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4706749Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4707379Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4708014Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4708144Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:55.4708219Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4708260Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4708303Z unimplemented [] 2025-12-04T09:58:55.4708363Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4708462Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4709040Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4709076Z graph_break [] 2025-12-04T09:58:55.4709152Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4709192Z Autotune Choices Stats: 2025-12-04T09:58:55.4709941Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.4710078Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4710203Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4710365Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4710985Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4711586Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4712206Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4712805Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4713412Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4714019Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4714631Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4715245Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4715840Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4716572Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4716700Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:55.4716742Z Autotune Choices Stats: 2025-12-04T09:58:55.4717496Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.4717712Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4717876Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4718151Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4718798Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4719439Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4720064Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4720697Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4721328Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4721958Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4722579Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4723220Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4723868Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4724492Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4724621Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:55.4724697Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4724740Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4724776Z unimplemented [] 2025-12-04T09:58:55.4724847Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4724948Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4725528Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4725565Z graph_break [] 2025-12-04T09:58:55.4725640Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4725680Z Autotune Choices Stats: 2025-12-04T09:58:55.4726451Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.4726578Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4726706Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4726867Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4727485Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4728114Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4728717Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4729335Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4729941Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4730545Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4731148Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4731758Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4732386Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4732989Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4733120Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:55.4733161Z Autotune Choices Stats: 2025-12-04T09:58:55.4733930Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.4734147Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4734318Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4735511Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4736188Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4736835Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4737488Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4738111Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4738743Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4739369Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4740033Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4740661Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4741303Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4741952Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4742082Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:55.4742161Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4742203Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4742244Z unimplemented [] 2025-12-04T09:58:55.4742304Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4742404Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4742985Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4743022Z graph_break [] 2025-12-04T09:58:55.4743101Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4743142Z Autotune Choices Stats: 2025-12-04T09:58:55.4743894Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:55.4744025Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4744160Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4744321Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4744937Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4745551Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4746227Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4746834Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4747440Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4748043Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4748672Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4749280Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4749890Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4750514Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4750645Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:55.4750686Z Autotune Choices Stats: 2025-12-04T09:58:55.4751444Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.4751662Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4751830Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4752108Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4752748Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4753381Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4754019Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4754664Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4755300Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4755963Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4756591Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4757241Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4757870Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4758513Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4758655Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:55.4758730Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4758784Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4758821Z unimplemented [] 2025-12-04T09:58:55.4758883Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4758983Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4759560Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4759598Z graph_break [] 2025-12-04T09:58:55.4759672Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4759713Z Autotune Choices Stats: 2025-12-04T09:58:55.4760458Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.4760588Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4760703Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4760867Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4761487Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4762100Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4762722Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4763343Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4763950Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4764559Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4765164Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4765784Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4766425Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4767062Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4767210Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:55.4767249Z Autotune Choices Stats: 2025-12-04T09:58:55.4768030Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.4768249Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4768418Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4768697Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4769331Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4769972Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4770597Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4771235Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4771883Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4772515Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4773144Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4773776Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4774416Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4775045Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4775188Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:55.4775276Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4775321Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4775358Z unimplemented [] 2025-12-04T09:58:55.4775421Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4775521Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4776153Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4776192Z graph_break [] 2025-12-04T09:58:55.4776270Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4776310Z Autotune Choices Stats: 2025-12-04T09:58:55.4777048Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.4777176Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4777290Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4777455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4778071Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4778697Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4779312Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4779942Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4780560Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4781168Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4781772Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4782379Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4782999Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4783599Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4783740Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:55.4783789Z Autotune Choices Stats: 2025-12-04T09:58:55.4784567Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:55.4784784Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4784951Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4785228Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4785858Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4786524Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4787169Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4787798Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4788440Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4789100Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4789726Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4790358Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4790992Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4791631Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4791760Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:55.4791844Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4791888Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4791926Z unimplemented [] 2025-12-04T09:58:55.4791989Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4792089Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4792662Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4792713Z graph_break [] 2025-12-04T09:58:55.4792786Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4792838Z Autotune Choices Stats: 2025-12-04T09:58:55.4793587Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.4793715Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4793831Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4793995Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4794611Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4795216Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4795838Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4796485Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4797109Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4797739Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4798350Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4798956Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4799557Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4800179Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4800321Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:55.4800362Z Autotune Choices Stats: 2025-12-04T09:58:55.4801123Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.4801350Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4801528Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4801810Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4802458Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4803087Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4803717Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4804362Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4804995Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4805637Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4806338Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4806971Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4807600Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4808225Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4808355Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:55.4808450Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4808492Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4808531Z unimplemented [] 2025-12-04T09:58:55.4808591Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4808691Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4809283Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4809322Z graph_break [] 2025-12-04T09:58:55.4809409Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4809451Z Autotune Choices Stats: 2025-12-04T09:58:55.4810204Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:55.4810334Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4810453Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4810614Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4811232Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4811840Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4812443Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4813060Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4813680Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4814313Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4814920Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4815530Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4816169Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4816777Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4816907Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:55.4816966Z Autotune Choices Stats: 2025-12-04T09:58:55.4817727Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:55.4817963Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4818145Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4818421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4819071Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4819698Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4820323Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4820950Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4821593Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4822249Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4822893Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4823538Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4824173Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4824804Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4824932Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:55.4825007Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4825049Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4825088Z unimplemented [] 2025-12-04T09:58:55.4825147Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4825249Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4825844Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4825896Z graph_break [] 2025-12-04T09:58:55.4826013Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4826055Z Autotune Choices Stats: 2025-12-04T09:58:55.4826807Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:55.4826957Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4827074Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4827248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4827859Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4828464Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4829076Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4829688Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4830406Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4831073Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4831781Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4832408Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4833019Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4833715Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4833916Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:55.4833957Z Autotune Choices Stats: 2025-12-04T09:58:55.4834737Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.4834955Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4835135Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4835414Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4836125Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4836754Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4837383Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4838010Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4838647Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4839299Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4839936Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4840590Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4841224Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4841857Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4841990Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:55.4842068Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4842112Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4842151Z unimplemented [] 2025-12-04T09:58:55.4842212Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4842312Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4842900Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4842936Z graph_break [] 2025-12-04T09:58:55.4843024Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4843063Z Autotune Choices Stats: 2025-12-04T09:58:55.4843806Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:55.4843946Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4844072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4844236Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4844857Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4845468Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4846124Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4846727Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4847365Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4847974Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4848590Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4849212Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4849819Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4850423Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4850553Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:55.4850594Z Autotune Choices Stats: 2025-12-04T09:58:55.4851356Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.4851573Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4851751Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4852028Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4852675Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4853321Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4853944Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4854575Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4855205Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4855847Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4856513Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4857159Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4857812Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4858441Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4858571Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:55.4858646Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4858691Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4858730Z unimplemented [] 2025-12-04T09:58:55.4858791Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4858889Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4859467Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4859505Z graph_break [] 2025-12-04T09:58:55.4859580Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4859620Z Autotune Choices Stats: 2025-12-04T09:58:55.4860391Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.4860520Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4860646Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4860809Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4861425Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4862071Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4862679Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4863289Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4863898Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4864508Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4865113Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4865733Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4866407Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4867006Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4867137Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:55.4867177Z Autotune Choices Stats: 2025-12-04T09:58:55.4867945Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.4868163Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4868328Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4868617Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4869251Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4869897Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4870545Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4871174Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4871808Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4872443Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4873082Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4873712Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4874354Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4875003Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4875133Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:55.4875210Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4875254Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4875292Z unimplemented [] 2025-12-04T09:58:55.4875353Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4875452Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4876077Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4876117Z graph_break [] 2025-12-04T09:58:55.4876194Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4876235Z Autotune Choices Stats: 2025-12-04T09:58:55.4876983Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:55.4877113Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4877252Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4877414Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4878029Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4878651Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4879285Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4879894Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4880499Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4881107Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4881725Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4882327Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4882939Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4883564Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4883694Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:55.4883735Z Autotune Choices Stats: 2025-12-04T09:58:55.4884494Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.4884712Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4884878Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4885155Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4885799Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4886466Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4889479Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4890146Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4890776Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4891406Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4892026Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4892670Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4893296Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4893930Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4894073Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:55.4894162Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4894210Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4894248Z unimplemented [] 2025-12-04T09:58:55.4894311Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4894414Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4894985Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4895025Z graph_break [] 2025-12-04T09:58:55.4895102Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4895144Z Autotune Choices Stats: 2025-12-04T09:58:55.4895885Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.4896052Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4896170Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4896331Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4896959Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4897580Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4898199Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4898813Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4899414Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4900014Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4900615Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4901228Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4901830Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4902441Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4902582Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:55.4902636Z Autotune Choices Stats: 2025-12-04T09:58:55.4903399Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.4903618Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4903786Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4904063Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4904698Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4905331Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4905988Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4906624Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4907275Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4907903Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4908527Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4909153Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4909792Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4910422Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4910560Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:55.4910648Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4910691Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4910728Z unimplemented [] 2025-12-04T09:58:55.4910790Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4910891Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4911480Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4911520Z graph_break [] 2025-12-04T09:58:55.4911594Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4911636Z Autotune Choices Stats: 2025-12-04T09:58:55.4912375Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:55.4912503Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4912623Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4912794Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4913407Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4914033Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4914647Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4915256Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4915866Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4916508Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4917120Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4917724Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4918338Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4918950Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4919091Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:55.4919131Z Autotune Choices Stats: 2025-12-04T09:58:55.4919893Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.4920112Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4920281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4920557Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4921196Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4921821Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4922450Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4923072Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4923711Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4924354Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4924976Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4925610Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4926285Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4926922Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4927065Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:55.4927140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4927184Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4927224Z unimplemented [] 2025-12-04T09:58:55.4927284Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4927384Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4927973Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4928012Z graph_break [] 2025-12-04T09:58:55.4928101Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4928142Z Autotune Choices Stats: 2025-12-04T09:58:55.4928877Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:55.4929004Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4929122Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4929283Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4929896Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4930503Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4931116Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4931728Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4932352Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4932958Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4933563Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4934167Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4934777Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4935387Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4935525Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:55.4935566Z Autotune Choices Stats: 2025-12-04T09:58:55.4936375Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.4936604Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4936785Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4937061Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4937693Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4938311Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4938937Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4939573Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4940213Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4940859Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4941492Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4942114Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4942740Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4943368Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4943508Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:55.4943583Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4943626Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4943663Z unimplemented [] 2025-12-04T09:58:55.4943723Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4943834Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4944407Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4944460Z graph_break [] 2025-12-04T09:58:55.4944535Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4944575Z Autotune Choices Stats: 2025-12-04T09:58:55.4945315Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:55.4945443Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4945559Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4945719Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4946350Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4946952Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4947576Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4948178Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4948789Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4949428Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4950033Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4950634Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4951237Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4951843Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4951980Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:55.4952020Z Autotune Choices Stats: 2025-12-04T09:58:55.4952775Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.4953002Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4953178Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4953467Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4954095Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4954724Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4955348Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4956014Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4956659Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4957297Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4957942Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4958566Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4959193Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4959815Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4959949Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:55.4960028Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4960076Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4960117Z unimplemented [] 2025-12-04T09:58:55.4960182Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4960285Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4960869Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.4960924Z graph_break [] 2025-12-04T09:58:55.4961002Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4961046Z Autotune Choices Stats: 2025-12-04T09:58:55.4961793Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.4961933Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4962056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4962217Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4962826Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4963434Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4964039Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4964650Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4965251Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4965868Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4966535Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4967131Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4967735Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4968338Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4968466Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:55.4968506Z Autotune Choices Stats: 2025-12-04T09:58:55.4969274Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.4969503Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4969668Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4969949Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4970601Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4971226Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4971852Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4972474Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4973109Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4973728Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4974362Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4975010Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4975633Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4976296Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4976426Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:55.4976501Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4976544Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4976580Z unimplemented [] 2025-12-04T09:58:55.4976642Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4976741Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4977324Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4977361Z graph_break [] 2025-12-04T09:58:55.4977434Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4977474Z Autotune Choices Stats: 2025-12-04T09:58:55.4978212Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:55.4978359Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4978484Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4978643Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4979270Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4979864Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4980464Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4981073Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4981689Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4982284Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4982900Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4983523Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4984124Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4984724Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4984853Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:55.4984893Z Autotune Choices Stats: 2025-12-04T09:58:55.4985646Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.4985877Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4986078Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4986376Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4987007Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4987668Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4988294Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4988915Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4989541Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4990185Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4990804Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4991440Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4992092Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4992716Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4992845Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:55.4992919Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.4992961Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.4992997Z unimplemented [] 2025-12-04T09:58:55.4993058Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.4993157Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.4993729Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.4993767Z graph_break [] 2025-12-04T09:58:55.4993841Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.4993881Z Autotune Choices Stats: 2025-12-04T09:58:55.4994637Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:55.4994774Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.4994888Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.4995047Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.4995657Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4996323Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4996930Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4997531Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4998131Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4998746Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.4999351Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.4999961Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5000587Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5001193Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5001322Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:55.5001361Z Autotune Choices Stats: 2025-12-04T09:58:55.5002117Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.5002336Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5002501Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5002788Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5003424Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5004058Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5004700Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5005327Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5005994Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5006616Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5007256Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5007886Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5008521Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5009180Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5009310Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:55.5009385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5009426Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5009463Z unimplemented [] 2025-12-04T09:58:55.5009523Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5009624Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5010199Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5010237Z graph_break [] 2025-12-04T09:58:55.5010310Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5010350Z Autotune Choices Stats: 2025-12-04T09:58:55.5011088Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:55.5011226Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5011339Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5011498Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5012116Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5012723Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5013331Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5013926Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5014531Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5015130Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5015755Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5016391Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5016995Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5017616Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5017744Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:55.5017786Z Autotune Choices Stats: 2025-12-04T09:58:55.5018535Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.5018755Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5018923Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5019199Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5019840Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5020464Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5021103Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5021744Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5022370Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5022999Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5023626Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5024258Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5024899Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5025535Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5025673Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:55.5025747Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5025789Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5025826Z unimplemented [] 2025-12-04T09:58:55.5025886Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5026005Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5026577Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5026619Z graph_break [] 2025-12-04T09:58:55.5026691Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5026733Z Autotune Choices Stats: 2025-12-04T09:58:55.5027472Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.5027600Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5027714Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5027873Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5028502Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5029112Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5029747Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5030350Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5030958Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5031565Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5032171Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5032780Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5033391Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5034002Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5034141Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:55.5034183Z Autotune Choices Stats: 2025-12-04T09:58:55.5034941Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.5035160Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5035324Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5035600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5036275Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5036912Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5037546Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5038185Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5038829Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5039455Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5040079Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5040710Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5041361Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5041992Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5042134Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:55.5042208Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5042249Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5042288Z unimplemented [] 2025-12-04T09:58:55.5042350Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5042448Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5043034Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5043072Z graph_break [] 2025-12-04T09:58:55.5043145Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5043184Z Autotune Choices Stats: 2025-12-04T09:58:55.5043924Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:55.5044051Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5044163Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5044326Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5044935Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5045553Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5046203Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5046831Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5047432Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5048033Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5048636Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5049237Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5049856Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5050467Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5050608Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:55.5050648Z Autotune Choices Stats: 2025-12-04T09:58:55.5051422Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.5051639Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5051803Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5052084Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5052714Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5053332Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5053962Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5054598Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5055248Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5055872Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5056537Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5057167Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5057792Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5058427Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5058569Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:55.5058643Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5058686Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5058722Z unimplemented [] 2025-12-04T09:58:55.5058783Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5058894Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5059479Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5059516Z graph_break [] 2025-12-04T09:58:55.5059590Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5059630Z Autotune Choices Stats: 2025-12-04T09:58:55.5060370Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:55.5060498Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5060610Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5060772Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5061388Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5062003Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5062605Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5063224Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5063835Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5064437Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5065041Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5065650Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5066282Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5066896Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5067036Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:55.5067075Z Autotune Choices Stats: 2025-12-04T09:58:55.5067842Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.5068083Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5068247Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5068526Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5069159Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5069791Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5070413Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5071046Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5071681Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5072327Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5072948Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5073573Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5074201Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5074832Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5074962Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:55.5075036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5075080Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5075127Z unimplemented [] 2025-12-04T09:58:55.5075187Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5075287Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5075860Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5075907Z graph_break [] 2025-12-04T09:58:55.5076020Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5076059Z Autotune Choices Stats: 2025-12-04T09:58:55.5076811Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1708", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.5076940Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5077054Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5077214Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5077826Z triton_flex_attention_1708 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5078433Z triton_flex_attention_1709 0.0109 ms 96.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5079048Z triton_flex_attention_1707 0.0117 ms 89.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5079651Z triton_flex_attention_1705 0.0130 ms 80.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5080268Z triton_flex_attention_1724 0.0135 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5080894Z triton_flex_attention_1706 0.0136 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5081497Z triton_flex_attention_1716 0.0142 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5082101Z triton_flex_attention_1722 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5082706Z triton_flex_attention_1714 0.0149 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5083321Z triton_flex_attention_1720 0.0162 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5083451Z SingleProcess AUTOTUNE benchmarking takes 0.2434 seconds and 0.4106 seconds precompiling for 24 choices 2025-12-04T09:58:55.5083490Z Autotune Choices Stats: 2025-12-04T09:58:55.5084250Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015359999611973763, "best_triton_pos": 0} 2025-12-04T09:58:55.5084487Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5084651Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5084939Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5085576Z triton_flex_attention_backward_1743 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5086239Z triton_flex_attention_backward_1737 0.0181 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5086861Z triton_flex_attention_backward_1734 0.0187 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5087500Z triton_flex_attention_backward_1735 0.0188 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5088127Z triton_flex_attention_backward_1745 0.0203 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5088768Z triton_flex_attention_backward_1744 0.0203 ms 75.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5089417Z triton_flex_attention_backward_1742 0.0218 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5090051Z triton_flex_attention_backward_1747 0.0220 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5090677Z triton_flex_attention_backward_1738 0.0228 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5091300Z triton_flex_attention_backward_1729 0.0230 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5091430Z SingleProcess AUTOTUNE benchmarking takes 0.2527 seconds and 0.7882 seconds precompiling for 22 choices 2025-12-04T09:58:55.5091505Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5091548Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5091584Z unimplemented [] 2025-12-04T09:58:55.5091645Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5091753Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5092328Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5092378Z graph_break [] 2025-12-04T09:58:55.5092450Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5092490Z Autotune Choices Stats: 2025-12-04T09:58:55.5093230Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1754", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009398999623954296, "best_triton_pos": 0} 2025-12-04T09:58:55.5093379Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5093492Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5093651Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5094268Z triton_flex_attention_1754 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5094868Z triton_flex_attention_1755 0.0104 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5095471Z triton_flex_attention_1752 0.0112 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5096121Z triton_flex_attention_1753 0.0117 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5096726Z triton_flex_attention_1750 0.0120 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5097343Z triton_flex_attention_1770 0.0132 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5097963Z triton_flex_attention_1751 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5098570Z triton_flex_attention_1762 0.0140 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5099174Z triton_flex_attention_1768 0.0146 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5099774Z triton_flex_attention_1760 0.0150 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5099902Z SingleProcess AUTOTUNE benchmarking takes 0.2227 seconds and 0.4678 seconds precompiling for 24 choices 2025-12-04T09:58:55.5099942Z Autotune Choices Stats: 2025-12-04T09:58:55.5100711Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.5100940Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5101103Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5101388Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5102025Z triton_flex_attention_backward_1789 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5102649Z triton_flex_attention_backward_1783 0.0184 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5103276Z triton_flex_attention_backward_1780 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5103902Z triton_flex_attention_backward_1781 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5104554Z triton_flex_attention_backward_1791 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5105181Z triton_flex_attention_backward_1790 0.0204 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5105815Z triton_flex_attention_backward_1788 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5106503Z triton_flex_attention_backward_1793 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5107130Z triton_flex_attention_backward_1784 0.0226 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5107758Z triton_flex_attention_backward_1775 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5107885Z SingleProcess AUTOTUNE benchmarking takes 0.2632 seconds and 0.8758 seconds precompiling for 22 choices 2025-12-04T09:58:55.5107960Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5108002Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5108041Z unimplemented [] 2025-12-04T09:58:55.5108101Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5108203Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5108788Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5108826Z graph_break [] 2025-12-04T09:58:55.5108899Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5108939Z Autotune Choices Stats: 2025-12-04T09:58:55.5109692Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1801", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.5109831Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5109945Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5110105Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5110728Z triton_flex_attention_1801 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5111331Z triton_flex_attention_1800 0.0108 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5111940Z triton_flex_attention_1816 0.0128 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5112536Z triton_flex_attention_1798 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5113154Z triton_flex_attention_1797 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5113751Z triton_flex_attention_1808 0.0133 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5114362Z triton_flex_attention_1814 0.0140 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5114981Z triton_flex_attention_1806 0.0150 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5115582Z triton_flex_attention_1799 0.0158 ms 66.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5116220Z triton_flex_attention_1812 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5116350Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4169 seconds precompiling for 24 choices 2025-12-04T09:58:55.5116391Z Autotune Choices Stats: 2025-12-04T09:58:55.5117150Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1835", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.5117391Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5117555Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5117844Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5118478Z triton_flex_attention_backward_1835 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5119134Z triton_flex_attention_backward_1829 0.0184 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5119756Z triton_flex_attention_backward_1826 0.0186 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5120382Z triton_flex_attention_backward_1827 0.0186 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5121018Z triton_flex_attention_backward_1837 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5121653Z triton_flex_attention_backward_1836 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5122274Z triton_flex_attention_backward_1834 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5122914Z triton_flex_attention_backward_1839 0.0221 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5123559Z triton_flex_attention_backward_1830 0.0228 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5124182Z triton_flex_attention_backward_1821 0.0230 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5124310Z SingleProcess AUTOTUNE benchmarking takes 0.2624 seconds and 0.8439 seconds precompiling for 22 choices 2025-12-04T09:58:55.5124384Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5124426Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5124464Z unimplemented [] 2025-12-04T09:58:55.5124526Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5124626Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5125199Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5125236Z graph_break [] 2025-12-04T09:58:55.5125309Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5125350Z Autotune Choices Stats: 2025-12-04T09:58:55.5126149Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1846", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009399999864399433, "best_triton_pos": 0} 2025-12-04T09:58:55.5126290Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5126405Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5126564Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5127183Z triton_flex_attention_1846 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5127800Z triton_flex_attention_1844 0.0102 ms 91.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5128403Z triton_flex_attention_1845 0.0120 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5129008Z triton_flex_attention_1843 0.0130 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5129616Z triton_flex_attention_1854 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5130237Z triton_flex_attention_1862 0.0134 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5130846Z triton_flex_attention_1842 0.0137 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5131456Z triton_flex_attention_1847 0.0138 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5132072Z triton_flex_attention_1860 0.0144 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5132672Z triton_flex_attention_1852 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5132800Z SingleProcess AUTOTUNE benchmarking takes 0.2274 seconds and 0.3833 seconds precompiling for 24 choices 2025-12-04T09:58:55.5132842Z Autotune Choices Stats: 2025-12-04T09:58:55.5133596Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01583999954164028, "best_triton_pos": 0} 2025-12-04T09:58:55.5133812Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5133992Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5134269Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5134899Z triton_flex_attention_backward_1881 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5135536Z triton_flex_attention_backward_1875 0.0184 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5136246Z triton_flex_attention_backward_1873 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5136868Z triton_flex_attention_backward_1872 0.0188 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5137503Z triton_flex_attention_backward_1883 0.0201 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5138134Z triton_flex_attention_backward_1882 0.0202 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5138785Z triton_flex_attention_backward_1880 0.0220 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5139423Z triton_flex_attention_backward_1885 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5140072Z triton_flex_attention_backward_1876 0.0224 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5140709Z triton_flex_attention_backward_1867 0.0232 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5140836Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.7872 seconds precompiling for 22 choices 2025-12-04T09:58:55.5140911Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5140953Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5140991Z unimplemented [] 2025-12-04T09:58:55.5141051Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5141151Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5141732Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5141770Z graph_break [] 2025-12-04T09:58:55.5141846Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5141886Z Autotune Choices Stats: 2025-12-04T09:58:55.5142642Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1893", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.5142770Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5142884Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5143058Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5143666Z triton_flex_attention_1893 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5144283Z triton_flex_attention_1892 0.0106 ms 95.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5144885Z triton_flex_attention_1891 0.0117 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5145488Z triton_flex_attention_1890 0.0127 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5146128Z triton_flex_attention_1908 0.0130 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5146729Z triton_flex_attention_1889 0.0132 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5147352Z triton_flex_attention_1900 0.0135 ms 74.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5147968Z triton_flex_attention_1906 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5148600Z triton_flex_attention_1898 0.0148 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5149206Z triton_flex_attention_1904 0.0162 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5149333Z SingleProcess AUTOTUNE benchmarking takes 0.2478 seconds and 0.5052 seconds precompiling for 24 choices 2025-12-04T09:58:55.5149377Z Autotune Choices Stats: 2025-12-04T09:58:55.5150139Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1927", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.5150354Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5150518Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5150794Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5151439Z triton_flex_attention_backward_1927 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5152085Z triton_flex_attention_backward_1921 0.0183 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5152718Z triton_flex_attention_backward_1918 0.0185 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5153352Z triton_flex_attention_backward_1919 0.0186 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5153982Z triton_flex_attention_backward_1929 0.0201 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5154610Z triton_flex_attention_backward_1928 0.0202 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5155230Z triton_flex_attention_backward_1926 0.0217 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5155866Z triton_flex_attention_backward_1931 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5156549Z triton_flex_attention_backward_1922 0.0227 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5157215Z triton_flex_attention_backward_1913 0.0230 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5157344Z SingleProcess AUTOTUNE benchmarking takes 0.2709 seconds and 0.9165 seconds precompiling for 22 choices 2025-12-04T09:58:55.5157420Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5157464Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5157502Z unimplemented [] 2025-12-04T09:58:55.5157564Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5157663Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5158237Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5158275Z graph_break [] 2025-12-04T09:58:55.5158351Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5158392Z Autotune Choices Stats: 2025-12-04T09:58:55.5159134Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1938", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009960000403225422, "best_triton_pos": 0} 2025-12-04T09:58:55.5159263Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5159375Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5159549Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5160154Z triton_flex_attention_1938 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5160768Z triton_flex_attention_1936 0.0100 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5161390Z triton_flex_attention_1939 0.0101 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5161990Z triton_flex_attention_1935 0.0129 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5162590Z triton_flex_attention_1937 0.0134 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5163199Z triton_flex_attention_1946 0.0137 ms 72.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5163804Z triton_flex_attention_1954 0.0139 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5164424Z triton_flex_attention_1952 0.0146 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5165042Z triton_flex_attention_1944 0.0151 ms 66.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5165665Z triton_flex_attention_1950 0.0165 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5165797Z SingleProcess AUTOTUNE benchmarking takes 0.2498 seconds and 0.4270 seconds precompiling for 24 choices 2025-12-04T09:58:55.5165838Z Autotune Choices Stats: 2025-12-04T09:58:55.5166630Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1973", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.5166846Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5167011Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5167292Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5167928Z triton_flex_attention_backward_1973 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5168574Z triton_flex_attention_backward_1967 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5169211Z triton_flex_attention_backward_1964 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5169868Z triton_flex_attention_backward_1965 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5170492Z triton_flex_attention_backward_1975 0.0199 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5171115Z triton_flex_attention_backward_1974 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5171740Z triton_flex_attention_backward_1972 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5172369Z triton_flex_attention_backward_1977 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5173004Z triton_flex_attention_backward_1968 0.0226 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5173637Z triton_flex_attention_backward_1959 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5173774Z SingleProcess AUTOTUNE benchmarking takes 0.2677 seconds and 0.8736 seconds precompiling for 22 choices 2025-12-04T09:58:55.5173850Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5173895Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5173931Z unimplemented [] 2025-12-04T09:58:55.5173992Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5174101Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5174677Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5174715Z graph_break [] 2025-12-04T09:58:55.5174787Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5174828Z Autotune Choices Stats: 2025-12-04T09:58:55.5175571Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1984", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009600000455975533, "best_triton_pos": 0} 2025-12-04T09:58:55.5175700Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5175814Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5176004Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5176630Z triton_flex_attention_1984 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5177234Z triton_flex_attention_1982 0.0101 ms 94.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5177847Z triton_flex_attention_1983 0.0116 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5178481Z triton_flex_attention_2000 0.0130 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5179084Z triton_flex_attention_1985 0.0132 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5179688Z triton_flex_attention_1981 0.0133 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5180296Z triton_flex_attention_1992 0.0137 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5180912Z triton_flex_attention_1998 0.0140 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5181510Z triton_flex_attention_1990 0.0150 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5182119Z triton_flex_attention_1996 0.0162 ms 59.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5182257Z SingleProcess AUTOTUNE benchmarking takes 0.2470 seconds and 0.3620 seconds precompiling for 24 choices 2025-12-04T09:58:55.5182296Z Autotune Choices Stats: 2025-12-04T09:58:55.5183082Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2019", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.5183301Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5183463Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5183744Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5184369Z triton_flex_attention_backward_2019 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5185000Z triton_flex_attention_backward_2013 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5185639Z triton_flex_attention_backward_2010 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5186297Z triton_flex_attention_backward_2011 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5186951Z triton_flex_attention_backward_2021 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5187586Z triton_flex_attention_backward_2020 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5188207Z triton_flex_attention_backward_2018 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5188834Z triton_flex_attention_backward_2023 0.0222 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5189475Z triton_flex_attention_backward_2014 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5190102Z triton_flex_attention_backward_2005 0.0232 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5190242Z SingleProcess AUTOTUNE benchmarking takes 0.2594 seconds and 0.8715 seconds precompiling for 22 choices 2025-12-04T09:58:55.5190314Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5190357Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5190404Z unimplemented [] 2025-12-04T09:58:55.5190468Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5190566Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5191147Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5191189Z graph_break [] 2025-12-04T09:58:55.5191262Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5191302Z Autotune Choices Stats: 2025-12-04T09:58:55.5192049Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2030", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.5192179Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5192292Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5192457Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5193062Z triton_flex_attention_2030 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5193669Z triton_flex_attention_2031 0.0108 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5194275Z triton_flex_attention_2026 0.0112 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5194888Z triton_flex_attention_2028 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5195508Z triton_flex_attention_2029 0.0116 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5196150Z triton_flex_attention_2046 0.0132 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5196753Z triton_flex_attention_2027 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5197352Z triton_flex_attention_2038 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5197964Z triton_flex_attention_2044 0.0144 ms 64.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5198568Z triton_flex_attention_2024 0.0147 ms 63.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5198707Z SingleProcess AUTOTUNE benchmarking takes 0.1936 seconds and 0.4021 seconds precompiling for 24 choices 2025-12-04T09:58:55.5198746Z Autotune Choices Stats: 2025-12-04T09:58:55.5199527Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2065", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.5199755Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5199919Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5200194Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5200835Z triton_flex_attention_backward_2065 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5201459Z triton_flex_attention_backward_2059 0.0182 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5202095Z triton_flex_attention_backward_2056 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5202723Z triton_flex_attention_backward_2057 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5203364Z triton_flex_attention_backward_2066 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5204011Z triton_flex_attention_backward_2067 0.0200 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5204635Z triton_flex_attention_backward_2064 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5205264Z triton_flex_attention_backward_2069 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5205892Z triton_flex_attention_backward_2060 0.0224 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5206554Z triton_flex_attention_backward_2051 0.0230 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5206685Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.8209 seconds precompiling for 22 choices 2025-12-04T09:58:55.5206780Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:55.5206841Z Traceback (most recent call last): 2025-12-04T09:58:55.5206994Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:55.5207033Z self.assertTrue( 2025-12-04T09:58:55.5207141Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:55.5207189Z raise self.failureException(msg) 2025-12-04T09:58:55.5207315Z AssertionError: False is not true : Log file /tmp/tmpr79ss940/flex_attention_configs.json was not created 2025-12-04T09:58:55.5207340Z 2025-12-04T09:58:55.5207417Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.5207582Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.5207585Z 2025-12-04T09:58:55.5207675Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.5207751Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5207793Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5207854Z unimplemented [] 2025-12-04T09:58:55.5207916Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5208496Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:55.5208597Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5208636Z graph_break [] 2025-12-04T09:58:55.5208710Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5209200Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:55.5209250Z current_size = base.storage().size() 2025-12-04T09:58:55.5209293Z Autotune Choices Stats: 2025-12-04T09:58:55.5210044Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:55.5210171Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5210286Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5210459Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5211070Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5211684Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5212305Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5212906Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5213504Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5214106Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5214714Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5215311Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5215980Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5216613Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5216741Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:55.5216784Z Autotune Choices Stats: 2025-12-04T09:58:55.5217548Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.5217764Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5217928Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5218207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5218838Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5219481Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5220116Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5220752Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5221379Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5222003Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5222622Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5223259Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5223882Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5224514Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5224652Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:55.5224727Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5224770Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5224808Z unimplemented [] 2025-12-04T09:58:55.5224879Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5224980Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5225561Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5225598Z graph_break [] 2025-12-04T09:58:55.5225674Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5225716Z Autotune Choices Stats: 2025-12-04T09:58:55.5226490Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:55.5226616Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5226728Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5226891Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5227513Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5228116Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5228723Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5229344Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5229950Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5230551Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5231143Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5231756Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5232361Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5232970Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5233110Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:55.5233150Z Autotune Choices Stats: 2025-12-04T09:58:55.5233921Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.5234139Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5234305Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5234583Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5235209Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5235833Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5236522Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5237151Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5237799Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5238426Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5239053Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5239673Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5240309Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5240939Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5241078Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:55.5241153Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5241196Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5241244Z unimplemented [] 2025-12-04T09:58:55.5241310Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5241408Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5241991Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5242030Z graph_break [] 2025-12-04T09:58:55.5242107Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5242147Z Autotune Choices Stats: 2025-12-04T09:58:55.5242890Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:55.5243018Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5243132Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5243294Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5243900Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5244509Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5245109Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5245721Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5246398Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5247000Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5247607Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5248208Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5248815Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5249415Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5249554Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:55.5249593Z Autotune Choices Stats: 2025-12-04T09:58:55.5250356Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.5250594Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5250760Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5251036Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5251668Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5252294Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5252927Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5253549Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5254188Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5254832Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5255456Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5256116Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5256743Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5257378Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5257509Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:55.5257582Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5257639Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5257677Z unimplemented [] 2025-12-04T09:58:55.5257739Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5257838Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5258414Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5258463Z graph_break [] 2025-12-04T09:58:55.5258538Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5258579Z Autotune Choices Stats: 2025-12-04T09:58:55.5259333Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:55.5259461Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5259572Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5259733Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5260346Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5260950Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5261559Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5262159Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5262774Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5263401Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5264001Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5264603Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5265203Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5265815Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5265978Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:55.5266017Z Autotune Choices Stats: 2025-12-04T09:58:55.5266788Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.5267015Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5267179Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5267471Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5268099Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5268728Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5269353Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5269988Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5270613Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5271246Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5271891Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5272517Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5273146Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5273768Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5273901Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:55.5273974Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5274016Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5274054Z unimplemented [] 2025-12-04T09:58:55.5274124Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5274222Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5274795Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5274843Z graph_break [] 2025-12-04T09:58:55.5274916Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5274956Z Autotune Choices Stats: 2025-12-04T09:58:55.5275705Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.5275842Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5275986Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5276147Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5276757Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5277372Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5277970Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5278585Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5279180Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5279791Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5280422Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5281021Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5281623Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5282220Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5282351Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:55.5282390Z Autotune Choices Stats: 2025-12-04T09:58:55.5283161Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.5283388Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5283554Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5283843Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5284484Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5285109Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5285725Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5286389Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5287029Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5287654Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5288294Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5288933Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5289558Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5290179Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5290308Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:55.5290382Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5290426Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5290465Z unimplemented [] 2025-12-04T09:58:55.5290525Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5290627Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5291209Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5291249Z graph_break [] 2025-12-04T09:58:55.5291322Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5291372Z Autotune Choices Stats: 2025-12-04T09:58:55.5292112Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.5292251Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5292364Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5292525Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5293143Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5293750Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5294354Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5294958Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5295578Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5296232Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5296848Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5297462Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5298062Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5298663Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5298793Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:55.5298835Z Autotune Choices Stats: 2025-12-04T09:58:55.5299602Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.5299820Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5299984Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5300275Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5300911Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5301553Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5302174Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5302800Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5303425Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5304064Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5304686Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5305331Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5306011Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5306635Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5306764Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:55.5306840Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5306882Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5306920Z unimplemented [] 2025-12-04T09:58:55.5306982Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5307083Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5307658Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5307697Z graph_break [] 2025-12-04T09:58:55.5307771Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5307814Z Autotune Choices Stats: 2025-12-04T09:58:55.5308571Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:55.5308711Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5308823Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5308984Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5309623Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5310227Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5310828Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5311428Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5312037Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5312646Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5313255Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5313877Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5314493Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5315094Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5315221Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:55.5315264Z Autotune Choices Stats: 2025-12-04T09:58:55.5316051Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.5316272Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5316450Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5316724Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5317358Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5317998Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5318648Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5319266Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5319894Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5320526Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5321156Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5321794Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5322432Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5323070Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5323198Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:55.5323271Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5323313Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5323352Z unimplemented [] 2025-12-04T09:58:55.5323413Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5323517Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5324085Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5324124Z graph_break [] 2025-12-04T09:58:55.5324198Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5324238Z Autotune Choices Stats: 2025-12-04T09:58:55.5324994Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:55.5325121Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5325234Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5325405Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5326052Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5326704Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5327308Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5327908Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5328507Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5329104Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5329729Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5330343Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5330960Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5331564Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5331696Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:55.5331741Z Autotune Choices Stats: 2025-12-04T09:58:55.5332493Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.5332708Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5332874Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5333148Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5333785Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5334412Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5335034Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5335666Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5336334Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5336959Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5337580Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5338233Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5338869Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5339514Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5339642Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:55.5339717Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5339759Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5339800Z unimplemented [] 2025-12-04T09:58:55.5339860Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5339964Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5340542Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5340579Z graph_break [] 2025-12-04T09:58:55.5340655Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5340695Z Autotune Choices Stats: 2025-12-04T09:58:55.5341436Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.5341565Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5341677Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5341854Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5342464Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5343085Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5343708Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5344308Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5344914Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5345517Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5346160Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5346779Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5347395Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5348017Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5348147Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:55.5348186Z Autotune Choices Stats: 2025-12-04T09:58:55.5348946Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.5349166Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5349333Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5349614Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5350243Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5350876Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5351510Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5352152Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5352776Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5353399Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5354024Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5354641Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5355276Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5355909Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5356085Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:55.5356158Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5356202Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5356238Z unimplemented [] 2025-12-04T09:58:55.5356301Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5356410Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5356986Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5357022Z graph_break [] 2025-12-04T09:58:55.5357098Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5357139Z Autotune Choices Stats: 2025-12-04T09:58:55.5357879Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:55.5358010Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5358124Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5358286Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5358918Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5359514Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5360129Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5360756Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5361355Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5361958Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5362561Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5363179Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5363782Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5364387Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5364528Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:55.5364568Z Autotune Choices Stats: 2025-12-04T09:58:55.5365337Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:55.5365557Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5365720Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5366043Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5366666Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5367289Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5367922Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5368556Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5369206Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5369828Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5370451Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5371077Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5371712Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5372339Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5372477Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:55.5372551Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5372594Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5372630Z unimplemented [] 2025-12-04T09:58:55.5372703Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5372803Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5373389Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5373427Z graph_break [] 2025-12-04T09:58:55.5373502Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5373541Z Autotune Choices Stats: 2025-12-04T09:58:55.5374282Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:55.5374410Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5374522Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5374685Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5375293Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5375909Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5376551Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5377161Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5377784Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5378382Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5378987Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5379587Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5380199Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5380797Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5380934Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:55.5380973Z Autotune Choices Stats: 2025-12-04T09:58:55.5381723Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:55.5381959Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5382122Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5382397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5383031Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5383653Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5384287Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5384915Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5385552Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5386232Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5386854Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5387485Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5388108Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5388742Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5388870Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:55.5388943Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5388999Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5389035Z unimplemented [] 2025-12-04T09:58:55.5389097Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5389196Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5389768Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5389829Z graph_break [] 2025-12-04T09:58:55.5389903Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5389943Z Autotune Choices Stats: 2025-12-04T09:58:55.5390693Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:55.5390820Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5390932Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5391095Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5393343Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5393956Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5394572Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5395174Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5395787Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5396452Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5397056Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5397662Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5398262Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5398869Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5398999Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:55.5399039Z Autotune Choices Stats: 2025-12-04T09:58:55.5399803Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:55.5400053Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5400219Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5400506Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5401132Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5401748Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5402370Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5403005Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5403632Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5404269Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5404907Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5405533Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5406205Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5406830Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5406959Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:55.5407035Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5407077Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5407116Z unimplemented [] 2025-12-04T09:58:55.5407191Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5407293Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5407868Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5407924Z graph_break [] 2025-12-04T09:58:55.5407999Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5408040Z Autotune Choices Stats: 2025-12-04T09:58:55.5408777Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:55.5408934Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5409049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5409210Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5409822Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5410418Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5411022Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5411635Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5412234Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5412838Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5413461Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5414062Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5414659Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5415260Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5415390Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:55.5415431Z Autotune Choices Stats: 2025-12-04T09:58:55.5416224Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.5416455Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5416626Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5416915Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5417560Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5418184Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5418804Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5419429Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5420069Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5420691Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5421321Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5421969Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5422588Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5423212Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5423339Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:55.5423413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5423457Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5423497Z unimplemented [] 2025-12-04T09:58:55.5423558Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5423659Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5424248Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5424289Z graph_break [] 2025-12-04T09:58:55.5424361Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5424402Z Autotune Choices Stats: 2025-12-04T09:58:55.5425151Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:55.5425287Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5425401Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5425562Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5426219Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5426823Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5427424Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5428020Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5428638Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5429234Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5429845Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5430467Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5431065Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5431660Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5431790Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:55.5431830Z Autotune Choices Stats: 2025-12-04T09:58:55.5432576Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:55.5432807Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5432973Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5433259Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5433886Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5434530Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5435154Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5435775Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5436431Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5437064Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5437691Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5438331Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5438992Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5439620Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5439748Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:55.5439823Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5439864Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5439902Z unimplemented [] 2025-12-04T09:58:55.5439963Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5440063Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5440635Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5440675Z graph_break [] 2025-12-04T09:58:55.5440750Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5440789Z Autotune Choices Stats: 2025-12-04T09:58:55.5441531Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:55.5441668Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5441783Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5441942Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5442561Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5443173Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5443776Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5444382Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5444986Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5445597Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5446257Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5446874Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5447484Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5448079Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5448207Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:55.5448247Z Autotune Choices Stats: 2025-12-04T09:58:55.5449006Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.5449222Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5449386Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5449673Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5450305Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5450949Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5451590Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5452211Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5452839Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5453462Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5454095Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5454729Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5455374Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5456048Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5456176Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:55.5456250Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5456292Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5456330Z unimplemented [] 2025-12-04T09:58:55.5456390Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5456490Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5457056Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5457093Z graph_break [] 2025-12-04T09:58:55.5457168Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5457207Z Autotune Choices Stats: 2025-12-04T09:58:55.5457962Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.5458089Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5458203Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5458375Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5458983Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5459613Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5460215Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5460813Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5461412Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5462017Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5462627Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5463231Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5463841Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5464451Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5464580Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:55.5464619Z Autotune Choices Stats: 2025-12-04T09:58:55.5465368Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.5465585Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5465750Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5466064Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5466716Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5467352Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5467987Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5468622Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5469249Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5469879Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5470502Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5471145Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5471787Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5472423Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5472560Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:55.5472635Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5472677Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5472714Z unimplemented [] 2025-12-04T09:58:55.5472779Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5472880Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5473450Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5473487Z graph_break [] 2025-12-04T09:58:55.5473561Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5473600Z Autotune Choices Stats: 2025-12-04T09:58:55.5474340Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.5474469Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5474582Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5474742Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5475368Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5476017Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5476644Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5477246Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5477846Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5478449Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5479048Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5479663Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5480274Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5480889Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5481017Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:55.5481056Z Autotune Choices Stats: 2025-12-04T09:58:55.5481816Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.5482037Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5482201Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5482483Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5483112Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5483745Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5484379Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5485010Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5485646Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5486315Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5486936Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5487556Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5488199Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5488838Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5488979Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:55.5489052Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5489095Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5489133Z unimplemented [] 2025-12-04T09:58:55.5489195Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5489307Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5489879Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5489915Z graph_break [] 2025-12-04T09:58:55.5489989Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5490028Z Autotune Choices Stats: 2025-12-04T09:58:55.5490768Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:55.5490894Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5491006Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5491169Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5491787Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5492386Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5493001Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5493620Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5494217Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5494817Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5495422Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5496058Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5496669Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5497284Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5497424Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:55.5497463Z Autotune Choices Stats: 2025-12-04T09:58:55.5498231Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.5498448Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5498611Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5498891Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5499522Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5500141Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5500769Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5501400Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5502050Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5502673Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5503295Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5503932Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5504553Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5505181Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5505321Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:55.5505395Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5505438Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5505474Z unimplemented [] 2025-12-04T09:58:55.5505546Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5505644Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5506254Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5506292Z graph_break [] 2025-12-04T09:58:55.5506365Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5506405Z Autotune Choices Stats: 2025-12-04T09:58:55.5507142Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.5507272Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5507384Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5507544Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5508158Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5508770Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5509367Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5509979Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5510609Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5511210Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5511808Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5512408Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5513023Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5513621Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5513759Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:55.5513798Z Autotune Choices Stats: 2025-12-04T09:58:55.5514556Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.5514793Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5514957Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5515234Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5515857Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5516521Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5517147Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5517780Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5518416Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5519066Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5519690Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5520314Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5520937Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5521571Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5521700Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:55.5521774Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5521816Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5521865Z unimplemented [] 2025-12-04T09:58:55.5521925Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5522026Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5522597Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5522646Z graph_break [] 2025-12-04T09:58:55.5522719Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5522760Z Autotune Choices Stats: 2025-12-04T09:58:55.5523508Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.5523638Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5523754Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5523913Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5524526Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5525126Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5525735Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5526378Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5526995Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5527623Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5528224Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5528826Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5529421Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5530032Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5530161Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:55.5530201Z Autotune Choices Stats: 2025-12-04T09:58:55.5530953Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:55.5531186Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5531350Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5531638Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5532267Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5532894Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5533517Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5534143Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5534772Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5535412Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5536098Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5536725Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5537357Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5537981Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5538109Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:55.5538184Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5538227Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5538266Z unimplemented [] 2025-12-04T09:58:55.5538325Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5538441Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5539013Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5539063Z graph_break [] 2025-12-04T09:58:55.5539138Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5539178Z Autotune Choices Stats: 2025-12-04T09:58:55.5539918Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.5540075Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5540188Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5540347Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5540955Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5541560Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5542160Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5542771Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5543371Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5543982Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5544600Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5545200Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5545804Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5546437Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5546565Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:55.5546607Z Autotune Choices Stats: 2025-12-04T09:58:55.5547379Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.5547610Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5547777Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5548068Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5548707Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5549328Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5549947Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5550574Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5551214Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5551839Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5552476Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5553121Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5553746Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5554373Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5554500Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:55.5554576Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5554617Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5554655Z unimplemented [] 2025-12-04T09:58:55.5554715Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5554819Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5555404Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5555442Z graph_break [] 2025-12-04T09:58:55.5555516Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5555555Z Autotune Choices Stats: 2025-12-04T09:58:55.5556313Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:55.5556472Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5556586Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5556746Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5557378Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5557981Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5558583Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5559187Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5559805Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5560405Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5561022Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5561649Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5562249Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5562850Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5562976Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:55.5563020Z Autotune Choices Stats: 2025-12-04T09:58:55.5563784Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:55.5564009Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5564176Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5564459Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5565091Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5565738Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5566408Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5567028Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5567659Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5568304Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5568926Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5569566Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5570220Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5570844Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5570973Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:55.5571048Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5571090Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5571126Z unimplemented [] 2025-12-04T09:58:55.5571190Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5571291Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5571864Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5571901Z graph_break [] 2025-12-04T09:58:55.5571975Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5572014Z Autotune Choices Stats: 2025-12-04T09:58:55.5572769Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:55.5572914Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5573026Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5573186Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5573802Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5574413Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5575016Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5575617Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5576252Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5576870Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5577488Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5578083Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5578708Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5579427Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5579556Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:55.5579595Z Autotune Choices Stats: 2025-12-04T09:58:55.5580356Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.5580573Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5580738Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5581031Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5581668Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5582301Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5582951Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5583576Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5584207Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5584837Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5585470Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5586133Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5586769Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5587425Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5587553Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:55.5587630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5587675Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5587713Z unimplemented [] 2025-12-04T09:58:55.5587773Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5587873Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5588446Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5588483Z graph_break [] 2025-12-04T09:58:55.5588558Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5588597Z Autotune Choices Stats: 2025-12-04T09:58:55.5589351Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:55.5589478Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5589591Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5589759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5590373Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5591004Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5591609Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5592212Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5592821Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5593424Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5594034Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5594651Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5595258Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5595870Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5596028Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:55.5596067Z Autotune Choices Stats: 2025-12-04T09:58:55.5596822Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.5597040Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5597204Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5597482Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5598132Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5598779Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5599417Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5600056Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5600684Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5601308Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5601932Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5602567Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5603201Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5603850Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5603977Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:55.5604050Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5604092Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5604128Z unimplemented [] 2025-12-04T09:58:55.5604194Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5604294Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5604866Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5604904Z graph_break [] 2025-12-04T09:58:55.5604976Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5605016Z Autotune Choices Stats: 2025-12-04T09:58:55.5605755Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.5605884Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5606041Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5606203Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5606832Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5607441Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5608074Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5608676Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5609273Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5609871Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5610478Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5611091Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5611703Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5612320Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5612451Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:55.5612490Z Autotune Choices Stats: 2025-12-04T09:58:55.5613254Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.5613472Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5613634Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5613913Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5614543Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5615185Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5615819Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5616504Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5617139Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5617762Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5618383Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5619010Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5619645Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5620279Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5620419Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:55.5620493Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5620539Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5620576Z unimplemented [] 2025-12-04T09:58:55.5620637Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5620745Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5621323Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5621362Z graph_break [] 2025-12-04T09:58:55.5621435Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5621475Z Autotune Choices Stats: 2025-12-04T09:58:55.5622216Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:55.5622344Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5622458Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5622619Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5623237Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5623841Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5624453Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5625081Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5625679Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5626314Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5626914Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5627532Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5628130Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5628743Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5628884Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:55.5628927Z Autotune Choices Stats: 2025-12-04T09:58:55.5629708Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.5629925Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5630090Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5630371Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5631000Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5631625Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5632260Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5632897Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5633540Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5634164Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5634787Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5635415Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5636087Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5636711Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5636853Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:55.5636938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5636981Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5637032Z unimplemented [] 2025-12-04T09:58:55.5637093Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5637194Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5637780Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5637821Z graph_break [] 2025-12-04T09:58:55.5637896Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5637938Z Autotune Choices Stats: 2025-12-04T09:58:55.5638668Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.5638795Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5638908Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5639070Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5639679Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5640292Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5640899Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5641515Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5642134Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5642734Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5643336Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5643945Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5644552Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5645155Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5645292Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:55.5645332Z Autotune Choices Stats: 2025-12-04T09:58:55.5646132Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.5646385Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5646550Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5646827Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5647461Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5648090Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5648725Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5649348Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5649989Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5650637Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5651255Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5651883Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5652515Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5653145Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5653273Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:55.5653350Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5653403Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5653441Z unimplemented [] 2025-12-04T09:58:55.5653501Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5653602Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5654175Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5654226Z graph_break [] 2025-12-04T09:58:55.5654300Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5654341Z Autotune Choices Stats: 2025-12-04T09:58:55.5655096Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:55.5655223Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5655336Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5655495Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5656140Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5656741Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5657362Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5657957Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5658573Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5659204Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5659812Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5660410Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5661012Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5661629Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5661758Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:55.5661798Z Autotune Choices Stats: 2025-12-04T09:58:55.5662571Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.5662796Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5662959Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5663243Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5663881Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5664505Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5665138Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5665774Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5666438Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5667078Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5667724Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5668359Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5668982Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5669607Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5669735Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:55.5669811Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5669853Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5669891Z unimplemented [] 2025-12-04T09:58:55.5669965Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5670065Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5670634Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5670681Z graph_break [] 2025-12-04T09:58:55.5670756Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5670796Z Autotune Choices Stats: 2025-12-04T09:58:55.5671547Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:55.5671683Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5671797Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5671960Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5672581Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5673181Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5673786Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5674397Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5675002Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5675617Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5676280Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5676887Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5677487Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5678089Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5678221Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:55.5678260Z Autotune Choices Stats: 2025-12-04T09:58:55.5679035Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.5679264Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5679430Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5679715Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5680355Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5680980Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5681609Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5682231Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5682868Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5683495Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5684130Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5684772Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5685396Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5686064Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5686192Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:55.5686265Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5686309Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5686346Z unimplemented [] 2025-12-04T09:58:55.5686407Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5686507Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5687091Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5687128Z graph_break [] 2025-12-04T09:58:55.5687202Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5687255Z Autotune Choices Stats: 2025-12-04T09:58:55.5687990Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:55.5688131Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5688243Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5688417Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5689027Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5689630Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5690231Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5690833Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5691450Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5692055Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5692671Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5693293Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5693890Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5694497Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5694629Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:55.5694670Z Autotune Choices Stats: 2025-12-04T09:58:55.5695438Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.5695655Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5695819Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5696141Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5696772Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5697433Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5698057Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5698679Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5699308Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5699954Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5700590Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5701233Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5701873Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5702496Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5702624Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:55.5702698Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5702742Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5702779Z unimplemented [] 2025-12-04T09:58:55.5702840Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5702939Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5703511Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5703549Z graph_break [] 2025-12-04T09:58:55.5703621Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5703662Z Autotune Choices Stats: 2025-12-04T09:58:55.5704416Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.5704552Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5704665Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5704837Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5705467Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5706103Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5706710Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5707405Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5708074Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5708698Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5709318Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5709963Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5710574Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5711176Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5711307Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:55.5711348Z Autotune Choices Stats: 2025-12-04T09:58:55.5712105Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.5712322Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5712501Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5712778Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5713423Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5714059Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5714721Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5715354Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5716097Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5716727Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5717371Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5718010Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5718659Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5719307Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5719438Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:55.5719514Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5719558Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5719596Z unimplemented [] 2025-12-04T09:58:55.5719658Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5719759Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5720331Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5720371Z graph_break [] 2025-12-04T09:58:55.5720447Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5720490Z Autotune Choices Stats: 2025-12-04T09:58:55.5721241Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:55.5721367Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5721483Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5721658Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5722273Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5722892Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5723492Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5724093Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5724696Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5725295Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5725909Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5726583Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5727214Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5727816Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5727944Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:55.5727985Z Autotune Choices Stats: 2025-12-04T09:58:55.5728756Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.5728972Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5729139Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5729415Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5730060Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5730702Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5731347Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5731968Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5732605Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5733235Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5733858Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5734502Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5735140Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5735783Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5735911Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:55.5736025Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5736069Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5736109Z unimplemented [] 2025-12-04T09:58:55.5736170Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5736271Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5736849Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5736887Z graph_break [] 2025-12-04T09:58:55.5736961Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5737002Z Autotune Choices Stats: 2025-12-04T09:58:55.5737745Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:55.5737871Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5737985Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5738165Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5738779Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5739390Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5740019Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5740623Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5741226Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5741828Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5742453Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5743056Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5743673Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5744294Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5744422Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:55.5744463Z Autotune Choices Stats: 2025-12-04T09:58:55.5745230Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.5745446Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5745614Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5745890Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5746553Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5747203Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5747843Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5748491Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5749120Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5749741Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5750371Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5751010Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5751633Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5752275Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5752415Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:55.5752489Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5752532Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5752568Z unimplemented [] 2025-12-04T09:58:55.5752641Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5752740Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5753312Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5753349Z graph_break [] 2025-12-04T09:58:55.5753422Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5753464Z Autotune Choices Stats: 2025-12-04T09:58:55.5754212Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:55.5754340Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5754452Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5754615Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5755229Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5755834Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5756494Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5757136Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5757733Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5758340Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5758950Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5759568Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5760170Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5760785Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5760924Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:55.5760964Z Autotune Choices Stats: 2025-12-04T09:58:55.5761738Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.5761953Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5762119Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5762397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5763032Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5763677Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5764298Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5764932Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5765583Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5766254Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5766880Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5767500Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5768150Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5768774Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5768914Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:55.5768987Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5769044Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5769080Z unimplemented [] 2025-12-04T09:58:55.5769142Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5769240Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5769831Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5769870Z graph_break [] 2025-12-04T09:58:55.5769949Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5769990Z Autotune Choices Stats: 2025-12-04T09:58:55.5770730Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.5770860Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5770971Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5771135Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5771739Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5772352Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5772953Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5773564Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5774193Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5774798Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5775398Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5776029Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5776655Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5777255Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5777396Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:55.5777436Z Autotune Choices Stats: 2025-12-04T09:58:55.5778219Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.5778436Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5778601Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5778881Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5779511Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5780137Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5780773Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5781399Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5782036Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5782684Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5783314Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5783942Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5784565Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5785197Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5785329Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:55.5785414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5785459Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5785498Z unimplemented [] 2025-12-04T09:58:55.5785558Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5785658Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5786273Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5786325Z graph_break [] 2025-12-04T09:58:55.5786402Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5786443Z Autotune Choices Stats: 2025-12-04T09:58:55.5787194Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:55.5787321Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5787434Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5787596Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5788210Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5788808Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5789421Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5790050Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5790664Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5791274Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5791875Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5792482Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5793082Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5793692Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5793822Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:55.5793873Z Autotune Choices Stats: 2025-12-04T09:58:55.5794632Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.5794859Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5795033Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5795313Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5795976Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5796604Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5797229Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5797868Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5798494Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5799134Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5799781Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5800408Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5801037Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5801658Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5801787Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:55.5801863Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5801917Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5801957Z unimplemented [] 2025-12-04T09:58:55.5802018Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5802120Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5802697Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5802745Z graph_break [] 2025-12-04T09:58:55.5802818Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5802873Z Autotune Choices Stats: 2025-12-04T09:58:55.5803621Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:55.5803750Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5803865Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5804028Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5804638Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5805243Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5805846Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5806510Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5807125Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5807741Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5808360Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5808960Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5809563Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5810170Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5810299Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:55.5810342Z Autotune Choices Stats: 2025-12-04T09:58:55.5811117Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.5811343Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5811506Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5811798Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5812450Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5813077Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5813699Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5814328Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5814965Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5815602Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5816268Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5816917Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5817543Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5818170Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5818299Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:55.5818375Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5818418Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5818455Z unimplemented [] 2025-12-04T09:58:55.5818515Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5818617Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5819204Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5819257Z graph_break [] 2025-12-04T09:58:55.5819332Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5819372Z Autotune Choices Stats: 2025-12-04T09:58:55.5820115Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1708", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.5820253Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5820371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5820546Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5821162Z triton_flex_attention_1708 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5821762Z triton_flex_attention_1709 0.0109 ms 96.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5822367Z triton_flex_attention_1707 0.0117 ms 89.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5822967Z triton_flex_attention_1705 0.0130 ms 80.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5823587Z triton_flex_attention_1724 0.0135 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5824198Z triton_flex_attention_1706 0.0136 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5824821Z triton_flex_attention_1716 0.0142 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5825427Z triton_flex_attention_1722 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5826067Z triton_flex_attention_1714 0.0149 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5826668Z triton_flex_attention_1720 0.0162 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5826796Z SingleProcess AUTOTUNE benchmarking takes 0.2434 seconds and 0.4106 seconds precompiling for 24 choices 2025-12-04T09:58:55.5826840Z Autotune Choices Stats: 2025-12-04T09:58:55.5827610Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015359999611973763, "best_triton_pos": 0} 2025-12-04T09:58:55.5827828Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5828010Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5828283Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5828930Z triton_flex_attention_backward_1743 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5829567Z triton_flex_attention_backward_1737 0.0181 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5830192Z triton_flex_attention_backward_1734 0.0187 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5830818Z triton_flex_attention_backward_1735 0.0188 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5831444Z triton_flex_attention_backward_1745 0.0203 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5832085Z triton_flex_attention_backward_1744 0.0203 ms 75.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5832717Z triton_flex_attention_backward_1742 0.0218 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5833363Z triton_flex_attention_backward_1747 0.0220 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5833989Z triton_flex_attention_backward_1738 0.0228 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5834615Z triton_flex_attention_backward_1729 0.0230 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5834743Z SingleProcess AUTOTUNE benchmarking takes 0.2527 seconds and 0.7882 seconds precompiling for 22 choices 2025-12-04T09:58:55.5834819Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5834862Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5834902Z unimplemented [] 2025-12-04T09:58:55.5834961Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5835060Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5835627Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5835664Z graph_break [] 2025-12-04T09:58:55.5835749Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5835790Z Autotune Choices Stats: 2025-12-04T09:58:55.5836567Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1754", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009398999623954296, "best_triton_pos": 0} 2025-12-04T09:58:55.5836711Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5836828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5837003Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5837625Z triton_flex_attention_1754 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5838223Z triton_flex_attention_1755 0.0104 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5838832Z triton_flex_attention_1752 0.0112 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5839431Z triton_flex_attention_1753 0.0117 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5840047Z triton_flex_attention_1750 0.0120 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5840657Z triton_flex_attention_1770 0.0132 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5841280Z triton_flex_attention_1751 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5841899Z triton_flex_attention_1762 0.0140 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5842508Z triton_flex_attention_1768 0.0146 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5843110Z triton_flex_attention_1760 0.0150 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5843239Z SingleProcess AUTOTUNE benchmarking takes 0.2227 seconds and 0.4678 seconds precompiling for 24 choices 2025-12-04T09:58:55.5843287Z Autotune Choices Stats: 2025-12-04T09:58:55.5844050Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.5844265Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5844443Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5844720Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5845364Z triton_flex_attention_backward_1789 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5846057Z triton_flex_attention_backward_1783 0.0184 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5846680Z triton_flex_attention_backward_1780 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5847306Z triton_flex_attention_backward_1781 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5847937Z triton_flex_attention_backward_1791 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5848564Z triton_flex_attention_backward_1790 0.0204 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5849198Z triton_flex_attention_backward_1788 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5849842Z triton_flex_attention_backward_1793 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5850497Z triton_flex_attention_backward_1784 0.0226 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5851122Z triton_flex_attention_backward_1775 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5851248Z SingleProcess AUTOTUNE benchmarking takes 0.2632 seconds and 0.8758 seconds precompiling for 22 choices 2025-12-04T09:58:55.5851325Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5851371Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5851409Z unimplemented [] 2025-12-04T09:58:55.5851470Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5851571Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5852142Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5852179Z graph_break [] 2025-12-04T09:58:55.5852257Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5852297Z Autotune Choices Stats: 2025-12-04T09:58:55.5853051Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1801", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.5853177Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5853310Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5853473Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5854081Z triton_flex_attention_1801 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5854709Z triton_flex_attention_1800 0.0108 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5855315Z triton_flex_attention_1816 0.0128 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5855916Z triton_flex_attention_1798 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5856567Z triton_flex_attention_1797 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5857191Z triton_flex_attention_1808 0.0133 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5857793Z triton_flex_attention_1814 0.0140 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5858407Z triton_flex_attention_1806 0.0150 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5859046Z triton_flex_attention_1799 0.0158 ms 66.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5859643Z triton_flex_attention_1812 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5859772Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4169 seconds precompiling for 24 choices 2025-12-04T09:58:55.5859811Z Autotune Choices Stats: 2025-12-04T09:58:55.5860567Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1835", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.5860782Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5860950Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5861226Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5861868Z triton_flex_attention_backward_1835 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5862503Z triton_flex_attention_backward_1829 0.0184 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5863148Z triton_flex_attention_backward_1826 0.0186 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5863772Z triton_flex_attention_backward_1827 0.0186 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5864399Z triton_flex_attention_backward_1837 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5865024Z triton_flex_attention_backward_1836 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5865660Z triton_flex_attention_backward_1834 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5866330Z triton_flex_attention_backward_1839 0.0221 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5866978Z triton_flex_attention_backward_1830 0.0228 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5867625Z triton_flex_attention_backward_1821 0.0230 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5868888Z SingleProcess AUTOTUNE benchmarking takes 0.2624 seconds and 0.8439 seconds precompiling for 22 choices 2025-12-04T09:58:55.5868970Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5869014Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5869051Z unimplemented [] 2025-12-04T09:58:55.5869114Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5869216Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5869790Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5869835Z graph_break [] 2025-12-04T09:58:55.5869918Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5869959Z Autotune Choices Stats: 2025-12-04T09:58:55.5870706Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1846", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009399999864399433, "best_triton_pos": 0} 2025-12-04T09:58:55.5870832Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5870969Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5871131Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5871744Z triton_flex_attention_1846 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5872348Z triton_flex_attention_1844 0.0102 ms 91.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5872975Z triton_flex_attention_1845 0.0120 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5873622Z triton_flex_attention_1843 0.0130 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5874229Z triton_flex_attention_1854 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5874839Z triton_flex_attention_1862 0.0134 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5875454Z triton_flex_attention_1842 0.0137 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5876102Z triton_flex_attention_1847 0.0138 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5876711Z triton_flex_attention_1860 0.0144 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5877344Z triton_flex_attention_1852 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5877497Z SingleProcess AUTOTUNE benchmarking takes 0.2274 seconds and 0.3833 seconds precompiling for 24 choices 2025-12-04T09:58:55.5877541Z Autotune Choices Stats: 2025-12-04T09:58:55.5878310Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01583999954164028, "best_triton_pos": 0} 2025-12-04T09:58:55.5878525Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5878694Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5878973Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5879619Z triton_flex_attention_backward_1881 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5880250Z triton_flex_attention_backward_1875 0.0184 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5880872Z triton_flex_attention_backward_1873 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5881515Z triton_flex_attention_backward_1872 0.0188 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5882152Z triton_flex_attention_backward_1883 0.0201 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5882786Z triton_flex_attention_backward_1882 0.0202 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5883412Z triton_flex_attention_backward_1880 0.0220 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5884049Z triton_flex_attention_backward_1885 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5884679Z triton_flex_attention_backward_1876 0.0224 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5885301Z triton_flex_attention_backward_1867 0.0232 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5885442Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.7872 seconds precompiling for 22 choices 2025-12-04T09:58:55.5885531Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5885576Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5885612Z unimplemented [] 2025-12-04T09:58:55.5885675Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5885773Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5886385Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5886422Z graph_break [] 2025-12-04T09:58:55.5886498Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5886540Z Autotune Choices Stats: 2025-12-04T09:58:55.5887286Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1893", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.5887416Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5887530Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5887693Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5888416Z triton_flex_attention_1893 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5889021Z triton_flex_attention_1892 0.0106 ms 95.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5889624Z triton_flex_attention_1891 0.0117 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5890254Z triton_flex_attention_1890 0.0127 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5890868Z triton_flex_attention_1908 0.0130 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5891475Z triton_flex_attention_1889 0.0132 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5892076Z triton_flex_attention_1900 0.0135 ms 74.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5892682Z triton_flex_attention_1906 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5893282Z triton_flex_attention_1898 0.0148 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5893887Z triton_flex_attention_1904 0.0162 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5894026Z SingleProcess AUTOTUNE benchmarking takes 0.2478 seconds and 0.5052 seconds precompiling for 24 choices 2025-12-04T09:58:55.5894065Z Autotune Choices Stats: 2025-12-04T09:58:55.5894826Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1927", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.5895054Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5895222Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5895501Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5896167Z triton_flex_attention_backward_1927 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5896804Z triton_flex_attention_backward_1921 0.0183 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5897431Z triton_flex_attention_backward_1918 0.0185 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5900697Z triton_flex_attention_backward_1919 0.0186 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5901376Z triton_flex_attention_backward_1929 0.0201 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5902018Z triton_flex_attention_backward_1928 0.0202 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5902642Z triton_flex_attention_backward_1926 0.0217 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5903277Z triton_flex_attention_backward_1931 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5903911Z triton_flex_attention_backward_1922 0.0227 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5904536Z triton_flex_attention_backward_1913 0.0230 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5904672Z SingleProcess AUTOTUNE benchmarking takes 0.2709 seconds and 0.9165 seconds precompiling for 22 choices 2025-12-04T09:58:55.5904771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5904817Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5904855Z unimplemented [] 2025-12-04T09:58:55.5904919Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5905022Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5905618Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5905667Z graph_break [] 2025-12-04T09:58:55.5905745Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5905787Z Autotune Choices Stats: 2025-12-04T09:58:55.5906574Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1938", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009960000403225422, "best_triton_pos": 0} 2025-12-04T09:58:55.5906704Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5906823Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5906987Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5907602Z triton_flex_attention_1938 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5908216Z triton_flex_attention_1936 0.0100 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5908818Z triton_flex_attention_1939 0.0101 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5909434Z triton_flex_attention_1935 0.0129 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5910048Z triton_flex_attention_1937 0.0134 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5910663Z triton_flex_attention_1946 0.0137 ms 72.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5911268Z triton_flex_attention_1954 0.0139 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5911874Z triton_flex_attention_1952 0.0146 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5912478Z triton_flex_attention_1944 0.0151 ms 66.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5913082Z triton_flex_attention_1950 0.0165 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5913214Z SingleProcess AUTOTUNE benchmarking takes 0.2498 seconds and 0.4270 seconds precompiling for 24 choices 2025-12-04T09:58:55.5913265Z Autotune Choices Stats: 2025-12-04T09:58:55.5914035Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1973", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.5914254Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5914432Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5914711Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5915341Z triton_flex_attention_backward_1973 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5915996Z triton_flex_attention_backward_1967 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5916637Z triton_flex_attention_backward_1964 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5917261Z triton_flex_attention_backward_1965 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5917889Z triton_flex_attention_backward_1975 0.0199 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5918545Z triton_flex_attention_backward_1974 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5919181Z triton_flex_attention_backward_1972 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5919810Z triton_flex_attention_backward_1977 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5920439Z triton_flex_attention_backward_1968 0.0226 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5921081Z triton_flex_attention_backward_1959 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5921212Z SingleProcess AUTOTUNE benchmarking takes 0.2677 seconds and 0.8736 seconds precompiling for 22 choices 2025-12-04T09:58:55.5921288Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5921332Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5921368Z unimplemented [] 2025-12-04T09:58:55.5921435Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5921534Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5922123Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5922162Z graph_break [] 2025-12-04T09:58:55.5922246Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5922287Z Autotune Choices Stats: 2025-12-04T09:58:55.5923037Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1984", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009600000455975533, "best_triton_pos": 0} 2025-12-04T09:58:55.5923176Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5923291Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5923452Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5924065Z triton_flex_attention_1984 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5924662Z triton_flex_attention_1982 0.0101 ms 94.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5925289Z triton_flex_attention_1983 0.0116 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5925894Z triton_flex_attention_2000 0.0130 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5926562Z triton_flex_attention_1985 0.0132 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5927159Z triton_flex_attention_1981 0.0133 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5927778Z triton_flex_attention_1992 0.0137 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5928378Z triton_flex_attention_1998 0.0140 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5928978Z triton_flex_attention_1990 0.0150 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5929596Z triton_flex_attention_1996 0.0162 ms 59.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5929728Z SingleProcess AUTOTUNE benchmarking takes 0.2470 seconds and 0.3620 seconds precompiling for 24 choices 2025-12-04T09:58:55.5929768Z Autotune Choices Stats: 2025-12-04T09:58:55.5930519Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2019", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.5930750Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5930930Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5931207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5931856Z triton_flex_attention_backward_2019 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5932482Z triton_flex_attention_backward_2013 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5933107Z triton_flex_attention_backward_2010 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5933741Z triton_flex_attention_backward_2011 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5934372Z triton_flex_attention_backward_2021 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5935008Z triton_flex_attention_backward_2020 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5935640Z triton_flex_attention_backward_2018 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5936324Z triton_flex_attention_backward_2023 0.0222 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5936949Z triton_flex_attention_backward_2014 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5937577Z triton_flex_attention_backward_2005 0.0232 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5937719Z SingleProcess AUTOTUNE benchmarking takes 0.2594 seconds and 0.8715 seconds precompiling for 22 choices 2025-12-04T09:58:55.5937796Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5937840Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5937877Z unimplemented [] 2025-12-04T09:58:55.5937939Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5938039Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5938620Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5938671Z graph_break [] 2025-12-04T09:58:55.5938746Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5938787Z Autotune Choices Stats: 2025-12-04T09:58:55.5939540Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2030", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.5939681Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5939798Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5939959Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5940578Z triton_flex_attention_2030 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5941185Z triton_flex_attention_2031 0.0108 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5941786Z triton_flex_attention_2026 0.0112 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5942393Z triton_flex_attention_2028 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5943001Z triton_flex_attention_2029 0.0116 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5943626Z triton_flex_attention_2046 0.0132 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5944227Z triton_flex_attention_2027 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5944844Z triton_flex_attention_2038 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5945452Z triton_flex_attention_2044 0.0144 ms 64.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5946082Z triton_flex_attention_2024 0.0147 ms 63.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5946225Z SingleProcess AUTOTUNE benchmarking takes 0.1936 seconds and 0.4021 seconds precompiling for 24 choices 2025-12-04T09:58:55.5946267Z Autotune Choices Stats: 2025-12-04T09:58:55.5947025Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2065", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.5947244Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5947426Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5947704Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5948347Z triton_flex_attention_backward_2065 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5948981Z triton_flex_attention_backward_2059 0.0182 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5949606Z triton_flex_attention_backward_2056 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5950232Z triton_flex_attention_backward_2057 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5950870Z triton_flex_attention_backward_2066 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5951497Z triton_flex_attention_backward_2067 0.0200 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5952145Z triton_flex_attention_backward_2064 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5952773Z triton_flex_attention_backward_2069 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5953409Z triton_flex_attention_backward_2060 0.0224 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5954035Z triton_flex_attention_backward_2051 0.0230 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5954163Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.8209 seconds precompiling for 22 choices 2025-12-04T09:58:55.5954238Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5954280Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5954317Z unimplemented [] 2025-12-04T09:58:55.5954377Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5954479Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5955062Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.5955103Z graph_break [] 2025-12-04T09:58:55.5955176Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5955218Z Autotune Choices Stats: 2025-12-04T09:58:55.5955988Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2077", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010320000350475311, "best_triton_pos": 0} 2025-12-04T09:58:55.5956130Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5956262Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5956421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5957041Z triton_flex_attention_2077 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5957640Z triton_flex_attention_2074 0.0118 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5958238Z triton_flex_attention_2076 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5958851Z triton_flex_attention_2073 0.0130 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5959455Z triton_flex_attention_2084 0.0136 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5960058Z triton_flex_attention_2092 0.0139 ms 74.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5960685Z triton_flex_attention_2090 0.0144 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5961284Z triton_flex_attention_2082 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5961895Z triton_flex_attention_2075 0.0154 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5962497Z triton_flex_attention_2088 0.0165 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5962624Z SingleProcess AUTOTUNE benchmarking takes 0.2499 seconds and 0.3908 seconds precompiling for 24 choices 2025-12-04T09:58:55.5962665Z Autotune Choices Stats: 2025-12-04T09:58:55.5963434Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2111", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.5963650Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5963815Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5964091Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5964742Z triton_flex_attention_backward_2111 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5965368Z triton_flex_attention_backward_2105 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5966037Z triton_flex_attention_backward_2110 0.0181 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5966664Z triton_flex_attention_backward_2102 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5967288Z triton_flex_attention_backward_2103 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5967926Z triton_flex_attention_backward_2113 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5968549Z triton_flex_attention_backward_2112 0.0204 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5969204Z triton_flex_attention_backward_2115 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5969832Z triton_flex_attention_backward_2097 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5970463Z triton_flex_attention_backward_2106 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5970592Z SingleProcess AUTOTUNE benchmarking takes 0.4709 seconds and 0.7187 seconds precompiling for 22 choices 2025-12-04T09:58:55.5970686Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:55.5970733Z Traceback (most recent call last): 2025-12-04T09:58:55.5970889Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:55.5970929Z self.assertTrue( 2025-12-04T09:58:55.5971036Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:55.5971087Z raise self.failureException(msg) 2025-12-04T09:58:55.5971216Z AssertionError: False is not true : Log file /tmp/tmpy5ckrs5_/flex_attention_configs.json was not created 2025-12-04T09:58:55.5971220Z 2025-12-04T09:58:55.5971294Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.5971463Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.5971476Z 2025-12-04T09:58:55.5971569Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.5971645Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5971691Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5971728Z unimplemented [] 2025-12-04T09:58:55.5971790Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5972368Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:55.5972483Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5972519Z graph_break [] 2025-12-04T09:58:55.5972593Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5973089Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:55.5973139Z current_size = base.storage().size() 2025-12-04T09:58:55.5973181Z Autotune Choices Stats: 2025-12-04T09:58:55.5973923Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:55.5974063Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5974178Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5974340Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5974945Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5975543Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5976193Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5976798Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5977427Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5978025Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5978643Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5979240Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5979836Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5980445Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5980576Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:55.5980616Z Autotune Choices Stats: 2025-12-04T09:58:55.5981376Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.5981616Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5981780Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5982072Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5982704Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5983327Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5983951Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5984591Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5985221Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5985865Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5986505Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5987142Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5987761Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5988382Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5988535Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:55.5988611Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.5988654Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.5988691Z unimplemented [] 2025-12-04T09:58:55.5988753Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.5988852Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.5989430Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.5989481Z graph_break [] 2025-12-04T09:58:55.5989555Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.5989597Z Autotune Choices Stats: 2025-12-04T09:58:55.5990346Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:55.5990488Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5990603Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5990764Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5991375Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5991979Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5992589Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5993188Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5993789Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5994408Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.5995008Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5995619Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5996279Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5996881Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5997026Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:55.5997066Z Autotune Choices Stats: 2025-12-04T09:58:55.5997825Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.5998043Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.5998220Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.5998511Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.5999139Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.5999781Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6000401Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6001024Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6001658Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6002279Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6002919Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6003544Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6004179Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6004801Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6004932Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:55.6005006Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6005050Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6005086Z unimplemented [] 2025-12-04T09:58:55.6005147Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6005247Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6005828Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.6005865Z graph_break [] 2025-12-04T09:58:55.6005975Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6006015Z Autotune Choices Stats: 2025-12-04T09:58:55.6006753Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:55.6006901Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6007030Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6007190Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6007813Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6008408Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6009002Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6009616Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6010221Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6010820Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6011447Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6012046Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6012659Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6013260Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6013387Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:55.6013428Z Autotune Choices Stats: 2025-12-04T09:58:55.6014196Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.6014416Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6014579Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6014856Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6015513Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6016176Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6016804Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6017426Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6018052Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6018692Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6019314Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6019967Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6020594Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6021226Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6021355Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:55.6021431Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6021473Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6021510Z unimplemented [] 2025-12-04T09:58:55.6021572Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6021673Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6022246Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6022285Z graph_break [] 2025-12-04T09:58:55.6022369Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6022411Z Autotune Choices Stats: 2025-12-04T09:58:55.6023152Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:55.6023281Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6023406Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6023566Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6024191Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6024803Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6025406Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6026042Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6026661Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6027267Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6027866Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6028490Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6029090Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6029708Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6029837Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:55.6029878Z Autotune Choices Stats: 2025-12-04T09:58:55.6030636Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.6030864Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6031030Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6031305Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6031947Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6032600Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6033222Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6033855Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6034483Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6035120Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6035741Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6036406Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6037059Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6037684Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6037824Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:55.6037899Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6037942Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6037980Z unimplemented [] 2025-12-04T09:58:55.6038040Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6038138Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6038724Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6038761Z graph_break [] 2025-12-04T09:58:55.6038835Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6038879Z Autotune Choices Stats: 2025-12-04T09:58:55.6039626Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.6039753Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6039867Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6040032Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6040646Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6041274Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6041884Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6042482Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6043089Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6043699Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6044302Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6044904Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6045525Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6046161Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6046289Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:55.6046332Z Autotune Choices Stats: 2025-12-04T09:58:55.6047079Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.6047300Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6047465Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6047758Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6048392Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6049016Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6049669Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6050291Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6050924Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6051553Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6052189Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6052811Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6053434Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6054080Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6054219Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:55.6054295Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6054337Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6054376Z unimplemented [] 2025-12-04T09:58:55.6054436Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6054536Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6055110Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6055148Z graph_break [] 2025-12-04T09:58:55.6055223Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6055263Z Autotune Choices Stats: 2025-12-04T09:58:55.6056036Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.6056163Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6056295Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6056455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6057064Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6057669Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6058299Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6058912Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6059516Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6060119Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6060739Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6061338Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6061939Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6062557Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6062702Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:55.6062743Z Autotune Choices Stats: 2025-12-04T09:58:55.6063503Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.6063717Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6063882Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6064157Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6064804Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6065424Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6066065Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6066715Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6067361Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6067992Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6068616Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6069255Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6069882Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6070501Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6070641Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:55.6070725Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6070770Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6070808Z unimplemented [] 2025-12-04T09:58:55.6070869Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6070967Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6071557Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.6071593Z graph_break [] 2025-12-04T09:58:55.6071668Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6071707Z Autotune Choices Stats: 2025-12-04T09:58:55.6072444Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:55.6072573Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6072687Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6072847Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6073464Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6074066Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6074678Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6075288Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6075901Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6076538Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6077139Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6077752Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6078358Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6078966Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6079107Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:55.6079165Z Autotune Choices Stats: 2025-12-04T09:58:55.6079922Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.6080152Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6080321Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6080599Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6081222Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6081856Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6082483Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6083110Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6083754Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6084393Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6085020Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6085650Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6086341Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6086969Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6087101Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:55.6087190Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6087236Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6087272Z unimplemented [] 2025-12-04T09:58:55.6087333Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6087433Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6088025Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6088075Z graph_break [] 2025-12-04T09:58:55.6088150Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6088189Z Autotune Choices Stats: 2025-12-04T09:58:55.6088927Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:55.6089054Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6089167Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6089336Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6089940Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6090554Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6091158Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6091786Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6092398Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6093012Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6093614Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6094226Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6094838Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6095440Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6095577Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:55.6095634Z Autotune Choices Stats: 2025-12-04T09:58:55.6096461Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.6096682Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6096860Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6097138Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6097769Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6098403Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6099035Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6099660Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6100287Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6100941Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6101576Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6102208Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6103535Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6104834Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6105628Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:55.6105867Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6106058Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6106167Z unimplemented [] 2025-12-04T09:58:55.6106285Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6106478Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6107211Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6107852Z graph_break [] 2025-12-04T09:58:55.6108001Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6108151Z Autotune Choices Stats: 2025-12-04T09:58:55.6108954Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.6109875Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6110150Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6110457Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6111264Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6112525Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6113787Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6115024Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6116337Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6117579Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6118825Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6120057Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6121288Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6122550Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6123318Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:55.6123522Z Autotune Choices Stats: 2025-12-04T09:58:55.6124338Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.6125367Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6125797Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6126307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6127264Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6128542Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6129823Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6131114Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6132404Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6133703Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6135090Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6136454Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6137739Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6139024Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6139807Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:55.6140077Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6140236Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6140345Z unimplemented [] 2025-12-04T09:58:55.6140463Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6140655Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6141365Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6142019Z graph_break [] 2025-12-04T09:58:55.6142146Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6142297Z Autotune Choices Stats: 2025-12-04T09:58:55.6143116Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:55.6144001Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6144289Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6144597Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6145400Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6146683Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6147918Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6149172Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6150406Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6151672Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6152911Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6154158Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6155395Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6156671Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6157450Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:55.6157654Z Autotune Choices Stats: 2025-12-04T09:58:55.6158487Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:55.6159485Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6159912Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6160387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6161335Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6162632Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6163912Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6165197Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6166542Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6167832Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6169165Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6170453Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6171747Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6173017Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6173809Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:55.6174048Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6174200Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6174308Z unimplemented [] 2025-12-04T09:58:55.6174426Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6174619Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6175336Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6176025Z graph_break [] 2025-12-04T09:58:55.6176151Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6176303Z Autotune Choices Stats: 2025-12-04T09:58:55.6177102Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:55.6178017Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6178302Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6178611Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6179411Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6180674Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6181912Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6183163Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6184402Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6185645Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6186953Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6188195Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6189444Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6190680Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6191447Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:55.6191651Z Autotune Choices Stats: 2025-12-04T09:58:55.6192476Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:55.6193487Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6193900Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6194369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6195328Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6196650Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6197955Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6199228Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6200512Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6201810Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6203094Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6204402Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6205694Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6207019Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6207807Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:55.6208048Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6208201Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6208311Z unimplemented [] 2025-12-04T09:58:55.6208429Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6208621Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6209328Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6209971Z graph_break [] 2025-12-04T09:58:55.6210113Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6210264Z Autotune Choices Stats: 2025-12-04T09:58:55.6211059Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:55.6211959Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6212250Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6212556Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6213381Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6214623Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6215857Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6217107Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6218361Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6219599Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6220839Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6222107Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6223349Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6224602Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6225374Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:55.6225579Z Autotune Choices Stats: 2025-12-04T09:58:55.6226426Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:55.6227433Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6227866Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6228337Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6229272Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6230586Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6231870Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6233164Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6234446Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6235749Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6237058Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6238336Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6239661Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6240945Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6241755Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:55.6241995Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6242149Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6242256Z unimplemented [] 2025-12-04T09:58:55.6242371Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6242562Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6243267Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.6243909Z graph_break [] 2025-12-04T09:58:55.6244036Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6244188Z Autotune Choices Stats: 2025-12-04T09:58:55.6245009Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:55.6245906Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6246225Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6246533Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6247334Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6248589Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6249823Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6251064Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6252300Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6253555Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6254789Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6256071Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6257340Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6258566Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6259347Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:55.6259548Z Autotune Choices Stats: 2025-12-04T09:58:55.6260365Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.6261372Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6261784Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6262274Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6263213Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6264501Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6265800Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6267114Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6268415Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6269704Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6270999Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6272286Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6273580Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6274888Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6275687Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:55.6275959Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6276111Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6276220Z unimplemented [] 2025-12-04T09:58:55.6276334Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6276528Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6277232Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6277871Z graph_break [] 2025-12-04T09:58:55.6278001Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6278155Z Autotune Choices Stats: 2025-12-04T09:58:55.6278961Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:55.6279855Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6280145Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6280451Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6281260Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6282495Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6283752Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6285009Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6286277Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6287514Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6288769Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6290009Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6291247Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6292506Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6293282Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:55.6293487Z Autotune Choices Stats: 2025-12-04T09:58:55.6294301Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:55.6295304Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6295718Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6296228Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6297192Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6298485Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6299765Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6301072Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6302367Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6303655Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6304937Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6306260Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6307548Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6308827Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6309632Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:55.6309884Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6310038Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6310145Z unimplemented [] 2025-12-04T09:58:55.6310262Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6310458Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6311177Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6311816Z graph_break [] 2025-12-04T09:58:55.6311944Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6312094Z Autotune Choices Stats: 2025-12-04T09:58:55.6312889Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:55.6313780Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6314052Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6314357Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6315174Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6316471Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6317715Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6318988Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6320246Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6321485Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6322723Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6323982Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6325219Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6326503Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6327282Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:55.6327486Z Autotune Choices Stats: 2025-12-04T09:58:55.6328317Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.6329330Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6329741Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6330215Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6331157Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6332449Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6333739Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6335013Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6336358Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6337665Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6338952Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6340243Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6341540Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6342831Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6343619Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:55.6343870Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6344023Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6344129Z unimplemented [] 2025-12-04T09:58:55.6344246Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6344440Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6345156Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6345808Z graph_break [] 2025-12-04T09:58:55.6345981Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6346134Z Autotune Choices Stats: 2025-12-04T09:58:55.6346944Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.6347844Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6348123Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6348435Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6349237Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6350490Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6351728Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6352997Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6354251Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6355500Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6356783Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6358023Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6359274Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6360517Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6361295Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:55.6361512Z Autotune Choices Stats: 2025-12-04T09:58:55.6362340Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.6363347Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6363778Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6364250Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6365200Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6366524Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6367818Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6369102Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6370387Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6371701Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6372995Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6374284Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6375577Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6376903Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6377687Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:55.6377924Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6378078Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6378184Z unimplemented [] 2025-12-04T09:58:55.6378298Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6378490Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6379201Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6379849Z graph_break [] 2025-12-04T09:58:55.6379975Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6380136Z Autotune Choices Stats: 2025-12-04T09:58:55.6380931Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.6381838Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6382113Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6382426Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6383238Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6387034Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6388305Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6389553Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6390820Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6392075Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6393327Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6394565Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6395795Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6397086Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6397852Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:55.6398056Z Autotune Choices Stats: 2025-12-04T09:58:55.6398883Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.6399895Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6400340Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6400814Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6401775Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6403049Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6404323Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6405621Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6406935Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6408236Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6409544Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6410845Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6412126Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6413404Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6414197Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:55.6414453Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6414609Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6414718Z unimplemented [] 2025-12-04T09:58:55.6414836Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6415030Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6415742Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6416423Z graph_break [] 2025-12-04T09:58:55.6416576Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6416726Z Autotune Choices Stats: 2025-12-04T09:58:55.6417538Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:55.6418429Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6418726Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6419036Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6419837Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6421080Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6422314Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6423565Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6424802Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6426109Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6427350Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6428611Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6429846Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6431083Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6431847Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:55.6432060Z Autotune Choices Stats: 2025-12-04T09:58:55.6432884Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.6433893Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6434321Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6434793Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6435736Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6437063Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6438348Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6439615Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6440914Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6442205Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6443501Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6444795Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6446135Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6447416Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6448210Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:55.6448446Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6448600Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6448706Z unimplemented [] 2025-12-04T09:58:55.6448822Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6449014Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6449740Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6450386Z graph_break [] 2025-12-04T09:58:55.6450513Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6450663Z Autotune Choices Stats: 2025-12-04T09:58:55.6451467Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.6452377Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6452669Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6452975Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6453774Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6455032Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6456299Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6457544Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6458787Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6460028Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6461292Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6462525Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6463769Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6465005Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6465771Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:55.6466009Z Autotune Choices Stats: 2025-12-04T09:58:55.6466839Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.6467847Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6468261Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6468732Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6469685Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6470981Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6472270Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6473549Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6474831Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6476179Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6477458Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6478753Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6480038Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6481329Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6482228Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:55.6482467Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6482622Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6482728Z unimplemented [] 2025-12-04T09:58:55.6482844Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6483038Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6483751Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6484387Z graph_break [] 2025-12-04T09:58:55.6484529Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6484681Z Autotune Choices Stats: 2025-12-04T09:58:55.6485476Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.6486403Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6486678Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6487004Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6487822Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6489053Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6490300Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6491538Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6492782Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6494026Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6495262Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6496557Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6497788Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6499041Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6499806Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:55.6500009Z Autotune Choices Stats: 2025-12-04T09:58:55.6500831Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:55.6501832Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6502257Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6502731Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6503667Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6504971Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6506284Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6507576Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6508860Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6510151Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6511437Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6512719Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6514031Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6515313Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6516143Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:55.6516380Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6516533Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6516639Z unimplemented [] 2025-12-04T09:58:55.6516756Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6516947Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6517650Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6518282Z graph_break [] 2025-12-04T09:58:55.6518409Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6518557Z Autotune Choices Stats: 2025-12-04T09:58:55.6519388Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.6520273Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6520546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6520854Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6521662Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6522943Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6524212Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6525459Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6526735Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6527360Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6527964Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6528564Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6529192Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6529797Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6529945Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:55.6529985Z Autotune Choices Stats: 2025-12-04T09:58:55.6530744Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.6530961Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6531130Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6531415Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6532060Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6532688Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6533339Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6533969Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6534611Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6535241Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6535882Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6536545Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6537174Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6537843Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6537981Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:55.6538081Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6538132Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6538173Z unimplemented [] 2025-12-04T09:58:55.6538243Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6538346Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6538938Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.6538979Z graph_break [] 2025-12-04T09:58:55.6539055Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6539095Z Autotune Choices Stats: 2025-12-04T09:58:55.6539839Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:55.6539971Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6540096Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6540263Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6540870Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6541472Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6542090Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6542700Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6543303Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6543905Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6544514Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6545116Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6545716Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6546374Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6546503Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:55.6546556Z Autotune Choices Stats: 2025-12-04T09:58:55.6547311Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:55.6547529Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6547698Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6547976Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6548620Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6549248Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6549878Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6550522Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6551146Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6551786Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6552413Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6553048Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6553674Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6554303Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6554443Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:55.6554517Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6554576Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6554615Z unimplemented [] 2025-12-04T09:58:55.6554679Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6554779Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6555365Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.6555404Z graph_break [] 2025-12-04T09:58:55.6555477Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6555519Z Autotune Choices Stats: 2025-12-04T09:58:55.6556298Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:55.6556427Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6556541Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6556704Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6557335Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6557940Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6558544Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6559173Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6559787Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6560394Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6561007Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6561618Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6562218Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6562819Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6562957Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:55.6562998Z Autotune Choices Stats: 2025-12-04T09:58:55.6563773Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.6564000Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6564165Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6564443Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6565077Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6565715Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6566383Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6567012Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6567674Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6568313Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6568939Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6569566Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6570213Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6570841Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6570969Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:55.6571063Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6571106Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6571142Z unimplemented [] 2025-12-04T09:58:55.6571203Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6571302Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6571889Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.6571939Z graph_break [] 2025-12-04T09:58:55.6572015Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6572056Z Autotune Choices Stats: 2025-12-04T09:58:55.6572794Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:55.6572921Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6573037Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6573198Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6573802Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6574423Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6575028Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6575633Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6576284Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6576904Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6577505Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6578109Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6578722Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6579328Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6579458Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:55.6579511Z Autotune Choices Stats: 2025-12-04T09:58:55.6580282Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.6580498Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6580674Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6580956Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6581589Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6582215Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6582850Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6583479Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6584108Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6584753Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6585391Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6586046Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6586677Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6587326Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6587455Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:55.6587532Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6587575Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6587613Z unimplemented [] 2025-12-04T09:58:55.6587673Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6587775Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6588349Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6588407Z graph_break [] 2025-12-04T09:58:55.6588479Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6588535Z Autotune Choices Stats: 2025-12-04T09:58:55.6589269Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.6589411Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6589525Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6589687Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6590302Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6590989Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6591601Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6592208Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6592829Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6593452Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6594062Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6594664Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6595261Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6595870Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6596037Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:55.6596081Z Autotune Choices Stats: 2025-12-04T09:58:55.6596842Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.6597078Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6597257Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6597532Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6598177Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6598808Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6599430Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6600078Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6600705Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6601344Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6601979Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6602630Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6603260Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6603885Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6604014Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:55.6604100Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6604142Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6604180Z unimplemented [] 2025-12-04T09:58:55.6604240Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6604340Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6604915Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6604965Z graph_break [] 2025-12-04T09:58:55.6605040Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6605079Z Autotune Choices Stats: 2025-12-04T09:58:55.6605834Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:55.6605990Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6606126Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6606286Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6606891Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6607489Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6608092Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6608709Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6609307Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6609937Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6610532Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6611140Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6611745Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6612352Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6612493Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:55.6612536Z Autotune Choices Stats: 2025-12-04T09:58:55.6613303Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.6613517Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6613696Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6613972Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6614617Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6615252Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6615877Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6616529Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6617173Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6617802Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6618456Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6619080Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6619720Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6620343Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6620473Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:55.6620550Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6620592Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6620629Z unimplemented [] 2025-12-04T09:58:55.6620691Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6620793Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6621380Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.6621417Z graph_break [] 2025-12-04T09:58:55.6621491Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6621533Z Autotune Choices Stats: 2025-12-04T09:58:55.6622276Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.6622412Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6622537Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6622700Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6623315Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6623933Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6624540Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6625152Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6625759Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6626383Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6627020Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6627626Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6628241Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6628843Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6628972Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:55.6629014Z Autotune Choices Stats: 2025-12-04T09:58:55.6629788Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.6630008Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6630173Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6630452Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6631104Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6631729Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6632367Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6632990Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6633620Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6634262Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6634886Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6635538Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6636204Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6636846Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6636978Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:55.6637052Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6637094Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6637130Z unimplemented [] 2025-12-04T09:58:55.6637192Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6637292Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6637859Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6637896Z graph_break [] 2025-12-04T09:58:55.6637983Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6638024Z Autotune Choices Stats: 2025-12-04T09:58:55.6638771Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:55.6638897Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6639024Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6639189Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6639804Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6640422Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6641036Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6641638Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6642247Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6642848Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6643456Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6644079Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6644681Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6645296Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6645428Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:55.6645467Z Autotune Choices Stats: 2025-12-04T09:58:55.6646264Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.6646496Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6646663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6646937Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6647575Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6648240Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6648862Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6649504Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6650135Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6650777Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6651400Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6652028Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6652674Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6653297Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6653436Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:55.6653511Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6653555Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6653592Z unimplemented [] 2025-12-04T09:58:55.6653653Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6653751Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6654332Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6654368Z graph_break [] 2025-12-04T09:58:55.6654444Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6654483Z Autotune Choices Stats: 2025-12-04T09:58:55.6655234Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:55.6655362Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6655475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6655633Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6656279Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6656914Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6657530Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6658133Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6658739Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6659357Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6659961Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6660565Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6661190Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6661803Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6661933Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:55.6661974Z Autotune Choices Stats: 2025-12-04T09:58:55.6662735Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.6662952Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6663119Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6663405Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6664037Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6664659Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6665304Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6665974Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6666622Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6667248Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6667902Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6668523Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6669150Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6669801Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6669941Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:55.6670016Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6670061Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6670102Z unimplemented [] 2025-12-04T09:58:55.6670163Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6670261Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6670834Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6670873Z graph_break [] 2025-12-04T09:58:55.6670947Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6670987Z Autotune Choices Stats: 2025-12-04T09:58:55.6671717Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:55.6671844Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6671969Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6672131Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6672741Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6673345Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6673971Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6674590Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6675196Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6675801Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6676456Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6677061Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6677662Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6678280Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6678422Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:55.6678461Z Autotune Choices Stats: 2025-12-04T09:58:55.6679225Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.6679442Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6679606Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6679883Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6680526Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6681156Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6681781Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6682430Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6683067Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6683694Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6684319Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6684955Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6685585Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6686244Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6686390Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:55.6686477Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6686520Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6686556Z unimplemented [] 2025-12-04T09:58:55.6686617Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6686729Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6687303Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.6687340Z graph_break [] 2025-12-04T09:58:55.6687414Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6687456Z Autotune Choices Stats: 2025-12-04T09:58:55.6688194Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.6688323Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6688438Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6688596Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6689226Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6689836Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6690452Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6691064Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6691675Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6692276Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6692880Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6693495Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6694099Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6694703Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6694850Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:55.6694890Z Autotune Choices Stats: 2025-12-04T09:58:55.6695640Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.6695866Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6696072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6696353Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6696988Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6697628Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6698254Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6698880Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6699538Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6700171Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6700793Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6701425Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6702063Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6702691Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6702831Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:55.6702906Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6702948Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6702985Z unimplemented [] 2025-12-04T09:58:55.6703046Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6703150Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6703731Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6703780Z graph_break [] 2025-12-04T09:58:55.6703853Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6703893Z Autotune Choices Stats: 2025-12-04T09:58:55.6704637Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:55.6704764Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6704879Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6705037Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6705655Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6706299Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6706901Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6707534Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6708134Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6708750Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6709350Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6709956Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6710566Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6711167Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6711307Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:55.6711348Z Autotune Choices Stats: 2025-12-04T09:58:55.6712112Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.6712339Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6712503Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6712780Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6713404Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6714035Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6714673Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6715299Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6716044Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6716691Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6717320Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6717949Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6718577Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6719215Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6719343Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:55.6719417Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6719462Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6719500Z unimplemented [] 2025-12-04T09:58:55.6719560Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6719662Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6720250Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6720286Z graph_break [] 2025-12-04T09:58:55.6720370Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6720410Z Autotune Choices Stats: 2025-12-04T09:58:55.6721155Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:55.6721292Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6721408Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6721570Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6722182Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6722793Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6723412Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6724013Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6724636Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6725241Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6725855Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6726487Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6727091Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6727715Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6727843Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:55.6727886Z Autotune Choices Stats: 2025-12-04T09:58:55.6728640Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.6728884Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6729054Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6729351Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6729995Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6730618Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6731241Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6731874Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6732510Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6733160Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6733790Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6734432Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6735063Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6735695Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6735843Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:55.6735920Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6735998Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6736038Z unimplemented [] 2025-12-04T09:58:55.6736108Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6736210Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6736798Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.6736855Z graph_break [] 2025-12-04T09:58:55.6736939Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6736984Z Autotune Choices Stats: 2025-12-04T09:58:55.6737739Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:55.6737881Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6737994Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6738154Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6738770Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6739379Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6739994Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6740599Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6741208Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6744643Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6745250Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6745863Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6746498Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6747123Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6747254Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:55.6747293Z Autotune Choices Stats: 2025-12-04T09:58:55.6748054Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.6748272Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6748456Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6748759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6749390Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6750029Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6750655Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6751280Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6751919Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6752542Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6753200Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6753834Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6754469Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6755110Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6755248Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:55.6755326Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6755376Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6755415Z unimplemented [] 2025-12-04T09:58:55.6755486Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6755598Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6756219Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6756261Z graph_break [] 2025-12-04T09:58:55.6756343Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6756385Z Autotune Choices Stats: 2025-12-04T09:58:55.6757133Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.6757292Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6757424Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6757592Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6758214Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6758812Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6759415Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6760035Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6760636Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6761238Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6761866Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6762477Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6763078Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6763683Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6763822Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:55.6763867Z Autotune Choices Stats: 2025-12-04T09:58:55.6764638Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.6764862Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6765028Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6765330Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6766021Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6766655Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6767301Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6767933Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6768577Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6769212Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6769839Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6770500Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6771142Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6771774Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6771911Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:55.6771984Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6772027Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6772066Z unimplemented [] 2025-12-04T09:58:55.6772128Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6772229Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6772814Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.6772854Z graph_break [] 2025-12-04T09:58:55.6772928Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6772973Z Autotune Choices Stats: 2025-12-04T09:58:55.6773714Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:55.6773855Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6773968Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6774126Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6774754Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6775376Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6776021Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6776630Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6777257Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6777858Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6778465Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6779099Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6779727Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6780328Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6780460Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:55.6780500Z Autotune Choices Stats: 2025-12-04T09:58:55.6781251Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.6781486Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6781652Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6781933Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6782570Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6783219Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6783856Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6784484Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6785118Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6785752Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6786410Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6787038Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6787693Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6788328Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6788457Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:55.6788537Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6788580Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6788618Z unimplemented [] 2025-12-04T09:58:55.6788680Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6788783Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6789349Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.6789386Z graph_break [] 2025-12-04T09:58:55.6789459Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6789501Z Autotune Choices Stats: 2025-12-04T09:58:55.6790253Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:55.6790383Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6790498Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6790658Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6791276Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6791884Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6792497Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6793113Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6793717Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6794328Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6794931Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6795534Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6796205Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6796830Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6796960Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:55.6797002Z Autotune Choices Stats: 2025-12-04T09:58:55.6797768Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.6797987Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6798152Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6798442Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6799074Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6799700Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6800346Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6800982Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6801613Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6802242Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6802873Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6803504Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6804136Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6804781Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6804919Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:55.6804995Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6805037Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6805076Z unimplemented [] 2025-12-04T09:58:55.6805136Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6805238Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6805811Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6805850Z graph_break [] 2025-12-04T09:58:55.6805961Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6806003Z Autotune Choices Stats: 2025-12-04T09:58:55.6806751Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1708", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.6806896Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6807011Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6807167Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6807780Z triton_flex_attention_1708 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6808420Z triton_flex_attention_1709 0.0109 ms 96.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6809024Z triton_flex_attention_1707 0.0117 ms 89.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6809637Z triton_flex_attention_1705 0.0130 ms 80.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6810240Z triton_flex_attention_1724 0.0135 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6810846Z triton_flex_attention_1706 0.0136 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6811456Z triton_flex_attention_1716 0.0142 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6812060Z triton_flex_attention_1722 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6812671Z triton_flex_attention_1714 0.0149 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6813292Z triton_flex_attention_1720 0.0162 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6813430Z SingleProcess AUTOTUNE benchmarking takes 0.2434 seconds and 0.4106 seconds precompiling for 24 choices 2025-12-04T09:58:55.6813473Z Autotune Choices Stats: 2025-12-04T09:58:55.6814232Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015359999611973763, "best_triton_pos": 0} 2025-12-04T09:58:55.6814450Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6814614Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6814892Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6815543Z triton_flex_attention_backward_1743 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6816203Z triton_flex_attention_backward_1737 0.0181 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6816824Z triton_flex_attention_backward_1734 0.0187 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6817494Z triton_flex_attention_backward_1735 0.0188 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6818138Z triton_flex_attention_backward_1745 0.0203 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6818764Z triton_flex_attention_backward_1744 0.0203 ms 75.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6819386Z triton_flex_attention_backward_1742 0.0218 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6820039Z triton_flex_attention_backward_1747 0.0220 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6820666Z triton_flex_attention_backward_1738 0.0228 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6821303Z triton_flex_attention_backward_1729 0.0230 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6821443Z SingleProcess AUTOTUNE benchmarking takes 0.2527 seconds and 0.7882 seconds precompiling for 22 choices 2025-12-04T09:58:55.6821520Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6821565Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6821617Z unimplemented [] 2025-12-04T09:58:55.6821677Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6821782Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6822360Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.6822401Z graph_break [] 2025-12-04T09:58:55.6822475Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6822516Z Autotune Choices Stats: 2025-12-04T09:58:55.6823251Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1754", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009398999623954296, "best_triton_pos": 0} 2025-12-04T09:58:55.6823380Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6823495Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6823655Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6824285Z triton_flex_attention_1754 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6824888Z triton_flex_attention_1755 0.0104 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6825512Z triton_flex_attention_1752 0.0112 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6826151Z triton_flex_attention_1753 0.0117 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6826779Z triton_flex_attention_1750 0.0120 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6827386Z triton_flex_attention_1770 0.0132 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6827991Z triton_flex_attention_1751 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6828614Z triton_flex_attention_1762 0.0140 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6829220Z triton_flex_attention_1768 0.0146 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6829853Z triton_flex_attention_1760 0.0150 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6829980Z SingleProcess AUTOTUNE benchmarking takes 0.2227 seconds and 0.4678 seconds precompiling for 24 choices 2025-12-04T09:58:55.6830020Z Autotune Choices Stats: 2025-12-04T09:58:55.6830785Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.6831015Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6831182Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6831459Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6832091Z triton_flex_attention_backward_1789 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6832724Z triton_flex_attention_backward_1783 0.0184 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6833352Z triton_flex_attention_backward_1780 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6833986Z triton_flex_attention_backward_1781 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6834623Z triton_flex_attention_backward_1791 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6835260Z triton_flex_attention_backward_1790 0.0204 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6835882Z triton_flex_attention_backward_1788 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6836550Z triton_flex_attention_backward_1793 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6837195Z triton_flex_attention_backward_1784 0.0226 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6837825Z triton_flex_attention_backward_1775 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6837967Z SingleProcess AUTOTUNE benchmarking takes 0.2632 seconds and 0.8758 seconds precompiling for 22 choices 2025-12-04T09:58:55.6838040Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6838084Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6838122Z unimplemented [] 2025-12-04T09:58:55.6838184Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6838296Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6838874Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6838924Z graph_break [] 2025-12-04T09:58:55.6838999Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6839040Z Autotune Choices Stats: 2025-12-04T09:58:55.6839783Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1801", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.6839913Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6840025Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6840185Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6840888Z triton_flex_attention_1801 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6841494Z triton_flex_attention_1800 0.0108 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6842109Z triton_flex_attention_1816 0.0128 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6842732Z triton_flex_attention_1798 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6843332Z triton_flex_attention_1797 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6843947Z triton_flex_attention_1808 0.0133 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6844559Z triton_flex_attention_1814 0.0140 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6845158Z triton_flex_attention_1806 0.0150 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6845783Z triton_flex_attention_1799 0.0158 ms 66.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6846424Z triton_flex_attention_1812 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6846567Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4169 seconds precompiling for 24 choices 2025-12-04T09:58:55.6846607Z Autotune Choices Stats: 2025-12-04T09:58:55.6847381Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1835", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.6847609Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6847773Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6848052Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6848690Z triton_flex_attention_backward_1835 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6849322Z triton_flex_attention_backward_1829 0.0184 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6849962Z triton_flex_attention_backward_1826 0.0186 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6850587Z triton_flex_attention_backward_1827 0.0186 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6851237Z triton_flex_attention_backward_1837 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6851863Z triton_flex_attention_backward_1836 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6852500Z triton_flex_attention_backward_1834 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6853133Z triton_flex_attention_backward_1839 0.0221 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6853760Z triton_flex_attention_backward_1830 0.0228 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6854401Z triton_flex_attention_backward_1821 0.0230 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6854532Z SingleProcess AUTOTUNE benchmarking takes 0.2624 seconds and 0.8439 seconds precompiling for 22 choices 2025-12-04T09:58:55.6854606Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6854654Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6854692Z unimplemented [] 2025-12-04T09:58:55.6854766Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6854866Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6855455Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6855492Z graph_break [] 2025-12-04T09:58:55.6855567Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6855607Z Autotune Choices Stats: 2025-12-04T09:58:55.6856401Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1846", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009399999864399433, "best_triton_pos": 0} 2025-12-04T09:58:55.6856529Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6856642Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6856806Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6857415Z triton_flex_attention_1846 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6858045Z triton_flex_attention_1844 0.0102 ms 91.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6858651Z triton_flex_attention_1845 0.0120 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6859252Z triton_flex_attention_1843 0.0130 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6859885Z triton_flex_attention_1854 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6860489Z triton_flex_attention_1862 0.0134 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6861109Z triton_flex_attention_1842 0.0137 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6861713Z triton_flex_attention_1847 0.0138 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6862330Z triton_flex_attention_1860 0.0144 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6862935Z triton_flex_attention_1852 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6863064Z SingleProcess AUTOTUNE benchmarking takes 0.2274 seconds and 0.3833 seconds precompiling for 24 choices 2025-12-04T09:58:55.6863103Z Autotune Choices Stats: 2025-12-04T09:58:55.6863859Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01583999954164028, "best_triton_pos": 0} 2025-12-04T09:58:55.6864095Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6864264Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6864553Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6865179Z triton_flex_attention_backward_1881 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6865873Z triton_flex_attention_backward_1875 0.0184 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6866560Z triton_flex_attention_backward_1873 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6867184Z triton_flex_attention_backward_1872 0.0188 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6867809Z triton_flex_attention_backward_1883 0.0201 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6868465Z triton_flex_attention_backward_1882 0.0202 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6869096Z triton_flex_attention_backward_1880 0.0220 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6869752Z triton_flex_attention_backward_1885 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6870380Z triton_flex_attention_backward_1876 0.0224 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6871022Z triton_flex_attention_backward_1867 0.0232 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6871153Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.7872 seconds precompiling for 22 choices 2025-12-04T09:58:55.6871226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6871270Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6871307Z unimplemented [] 2025-12-04T09:58:55.6871368Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6871468Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6872036Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6872085Z graph_break [] 2025-12-04T09:58:55.6872159Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6872199Z Autotune Choices Stats: 2025-12-04T09:58:55.6872954Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1893", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.6873091Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6873205Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6873368Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6873980Z triton_flex_attention_1893 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6874584Z triton_flex_attention_1892 0.0106 ms 95.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6875204Z triton_flex_attention_1891 0.0117 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6875815Z triton_flex_attention_1890 0.0127 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6876457Z triton_flex_attention_1908 0.0130 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6877094Z triton_flex_attention_1889 0.0132 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6877710Z triton_flex_attention_1900 0.0135 ms 74.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6878317Z triton_flex_attention_1906 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6878922Z triton_flex_attention_1898 0.0148 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6879545Z triton_flex_attention_1904 0.0162 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6879675Z SingleProcess AUTOTUNE benchmarking takes 0.2478 seconds and 0.5052 seconds precompiling for 24 choices 2025-12-04T09:58:55.6879716Z Autotune Choices Stats: 2025-12-04T09:58:55.6880484Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1927", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.6880712Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6880874Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6881163Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6881794Z triton_flex_attention_backward_1927 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6882436Z triton_flex_attention_backward_1921 0.0183 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6883062Z triton_flex_attention_backward_1918 0.0185 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6883701Z triton_flex_attention_backward_1919 0.0186 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6884338Z triton_flex_attention_backward_1929 0.0201 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6884964Z triton_flex_attention_backward_1928 0.0202 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6885603Z triton_flex_attention_backward_1926 0.0217 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6886278Z triton_flex_attention_backward_1931 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6886911Z triton_flex_attention_backward_1922 0.0227 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6887534Z triton_flex_attention_backward_1913 0.0230 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6887664Z SingleProcess AUTOTUNE benchmarking takes 0.2709 seconds and 0.9165 seconds precompiling for 22 choices 2025-12-04T09:58:55.6887739Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6887782Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6887820Z unimplemented [] 2025-12-04T09:58:55.6887898Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6887998Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6888572Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6888608Z graph_break [] 2025-12-04T09:58:55.6888683Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6888724Z Autotune Choices Stats: 2025-12-04T09:58:55.6889468Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1938", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009960000403225422, "best_triton_pos": 0} 2025-12-04T09:58:55.6889626Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6889739Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6889912Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6890519Z triton_flex_attention_1938 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6891120Z triton_flex_attention_1936 0.0100 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6891724Z triton_flex_attention_1939 0.0101 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6892340Z triton_flex_attention_1935 0.0129 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6892947Z triton_flex_attention_1937 0.0134 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6893557Z triton_flex_attention_1946 0.0137 ms 72.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6894189Z triton_flex_attention_1954 0.0139 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6894803Z triton_flex_attention_1952 0.0146 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6895408Z triton_flex_attention_1944 0.0151 ms 66.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6896048Z triton_flex_attention_1950 0.0165 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6896177Z SingleProcess AUTOTUNE benchmarking takes 0.2498 seconds and 0.4270 seconds precompiling for 24 choices 2025-12-04T09:58:55.6896220Z Autotune Choices Stats: 2025-12-04T09:58:55.6897004Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1973", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.6897222Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6897385Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6897674Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6898313Z triton_flex_attention_backward_1973 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6898944Z triton_flex_attention_backward_1967 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6899566Z triton_flex_attention_backward_1964 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6900197Z triton_flex_attention_backward_1965 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6900834Z triton_flex_attention_backward_1975 0.0199 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6901462Z triton_flex_attention_backward_1974 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6902088Z triton_flex_attention_backward_1972 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6902752Z triton_flex_attention_backward_1977 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6903386Z triton_flex_attention_backward_1968 0.0226 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6904011Z triton_flex_attention_backward_1959 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6904140Z SingleProcess AUTOTUNE benchmarking takes 0.2677 seconds and 0.8736 seconds precompiling for 22 choices 2025-12-04T09:58:55.6904217Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6904260Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6904301Z unimplemented [] 2025-12-04T09:58:55.6904362Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6904462Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6905048Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6905088Z graph_break [] 2025-12-04T09:58:55.6905161Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6905213Z Autotune Choices Stats: 2025-12-04T09:58:55.6905988Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1984", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009600000455975533, "best_triton_pos": 0} 2025-12-04T09:58:55.6906126Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6906241Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6906401Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6907035Z triton_flex_attention_1984 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6907649Z triton_flex_attention_1982 0.0101 ms 94.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6910128Z triton_flex_attention_1983 0.0116 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6910742Z triton_flex_attention_2000 0.0130 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6911381Z triton_flex_attention_1985 0.0132 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6911984Z triton_flex_attention_1981 0.0133 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6912594Z triton_flex_attention_1992 0.0137 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6913210Z triton_flex_attention_1998 0.0140 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6913825Z triton_flex_attention_1990 0.0150 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6914430Z triton_flex_attention_1996 0.0162 ms 59.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6914561Z SingleProcess AUTOTUNE benchmarking takes 0.2470 seconds and 0.3620 seconds precompiling for 24 choices 2025-12-04T09:58:55.6914602Z Autotune Choices Stats: 2025-12-04T09:58:55.6915371Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2019", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.6915591Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6915756Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6916077Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6916711Z triton_flex_attention_backward_2019 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6917363Z triton_flex_attention_backward_2013 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6918003Z triton_flex_attention_backward_2010 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6918628Z triton_flex_attention_backward_2011 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6919257Z triton_flex_attention_backward_2021 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6919899Z triton_flex_attention_backward_2020 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6920520Z triton_flex_attention_backward_2018 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6921148Z triton_flex_attention_backward_2023 0.0222 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6921796Z triton_flex_attention_backward_2014 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6922434Z triton_flex_attention_backward_2005 0.0232 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6922562Z SingleProcess AUTOTUNE benchmarking takes 0.2594 seconds and 0.8715 seconds precompiling for 22 choices 2025-12-04T09:58:55.6922639Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6922683Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6922722Z unimplemented [] 2025-12-04T09:58:55.6922784Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6922886Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6923456Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6923493Z graph_break [] 2025-12-04T09:58:55.6923567Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6923610Z Autotune Choices Stats: 2025-12-04T09:58:55.6924365Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2030", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.6924491Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6924604Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6924777Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6925392Z triton_flex_attention_2030 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6926021Z triton_flex_attention_2031 0.0108 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6926639Z triton_flex_attention_2026 0.0112 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6927238Z triton_flex_attention_2028 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6927838Z triton_flex_attention_2029 0.0116 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6928454Z triton_flex_attention_2046 0.0132 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6929055Z triton_flex_attention_2027 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6929668Z triton_flex_attention_2038 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6930280Z triton_flex_attention_2044 0.0144 ms 64.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6930895Z triton_flex_attention_2024 0.0147 ms 63.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6931022Z SingleProcess AUTOTUNE benchmarking takes 0.1936 seconds and 0.4021 seconds precompiling for 24 choices 2025-12-04T09:58:55.6931062Z Autotune Choices Stats: 2025-12-04T09:58:55.6931818Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2065", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.6932036Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6932211Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6932489Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6933124Z triton_flex_attention_backward_2065 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6933752Z triton_flex_attention_backward_2059 0.0182 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6934384Z triton_flex_attention_backward_2056 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6935025Z triton_flex_attention_backward_2057 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6935653Z triton_flex_attention_backward_2066 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6936322Z triton_flex_attention_backward_2067 0.0200 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6936956Z triton_flex_attention_backward_2064 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6937592Z triton_flex_attention_backward_2069 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6938244Z triton_flex_attention_backward_2060 0.0224 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6938867Z triton_flex_attention_backward_2051 0.0230 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6939008Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.8209 seconds precompiling for 22 choices 2025-12-04T09:58:55.6939084Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6939126Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6939164Z unimplemented [] 2025-12-04T09:58:55.6939227Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6939327Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6939904Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.6939942Z graph_break [] 2025-12-04T09:58:55.6940017Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6940057Z Autotune Choices Stats: 2025-12-04T09:58:55.6940807Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2077", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010320000350475311, "best_triton_pos": 0} 2025-12-04T09:58:55.6940935Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6941051Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6941211Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6941832Z triton_flex_attention_2077 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6942470Z triton_flex_attention_2074 0.0118 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6943072Z triton_flex_attention_2076 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6943684Z triton_flex_attention_2073 0.0130 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6944291Z triton_flex_attention_2084 0.0136 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6944897Z triton_flex_attention_2092 0.0139 ms 74.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6945508Z triton_flex_attention_2090 0.0144 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6946146Z triton_flex_attention_2082 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6946776Z triton_flex_attention_2075 0.0154 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6947380Z triton_flex_attention_2088 0.0165 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6947525Z SingleProcess AUTOTUNE benchmarking takes 0.2499 seconds and 0.3908 seconds precompiling for 24 choices 2025-12-04T09:58:55.6947566Z Autotune Choices Stats: 2025-12-04T09:58:55.6948321Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2111", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.6948539Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6948706Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6948981Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6949626Z triton_flex_attention_backward_2111 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6950252Z triton_flex_attention_backward_2105 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6950898Z triton_flex_attention_backward_2110 0.0181 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6951521Z triton_flex_attention_backward_2102 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6952153Z triton_flex_attention_backward_2103 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6952787Z triton_flex_attention_backward_2113 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6953416Z triton_flex_attention_backward_2112 0.0204 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6954054Z triton_flex_attention_backward_2115 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6954678Z triton_flex_attention_backward_2097 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6955325Z triton_flex_attention_backward_2106 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6955453Z SingleProcess AUTOTUNE benchmarking takes 0.4709 seconds and 0.7187 seconds precompiling for 22 choices 2025-12-04T09:58:55.6955528Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6955584Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6955622Z unimplemented [] 2025-12-04T09:58:55.6955683Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6955782Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6956408Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6956445Z graph_break [] 2025-12-04T09:58:55.6956519Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6956559Z Autotune Choices Stats: 2025-12-04T09:58:55.6957304Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2122", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008960000239312649, "best_triton_pos": 0} 2025-12-04T09:58:55.6957433Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6957546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6957724Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6958343Z triton_flex_attention_2122 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6958945Z triton_flex_attention_2123 0.0100 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6959589Z triton_flex_attention_2119 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6960191Z triton_flex_attention_2121 0.0133 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6960816Z triton_flex_attention_2138 0.0134 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6961419Z triton_flex_attention_2130 0.0139 ms 64.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6962026Z triton_flex_attention_2120 0.0142 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6962628Z triton_flex_attention_2136 0.0145 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6963227Z triton_flex_attention_2128 0.0149 ms 60.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6963846Z triton_flex_attention_2134 0.0166 ms 53.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6963976Z SingleProcess AUTOTUNE benchmarking takes 0.2470 seconds and 0.4797 seconds precompiling for 24 choices 2025-12-04T09:58:55.6964028Z Autotune Choices Stats: 2025-12-04T09:58:55.6964783Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2157", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.6964999Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6965165Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6965445Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6966115Z triton_flex_attention_backward_2157 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6966759Z triton_flex_attention_backward_2151 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6967383Z triton_flex_attention_backward_2149 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6968031Z triton_flex_attention_backward_2148 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6968657Z triton_flex_attention_backward_2159 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6969292Z triton_flex_attention_backward_2158 0.0203 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6969914Z triton_flex_attention_backward_2156 0.0216 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6970552Z triton_flex_attention_backward_2161 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6971181Z triton_flex_attention_backward_2152 0.0228 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6971803Z triton_flex_attention_backward_2143 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6971943Z SingleProcess AUTOTUNE benchmarking takes 0.2555 seconds and 0.9394 seconds precompiling for 22 choices 2025-12-04T09:58:55.6972037Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:55.6972087Z Traceback (most recent call last): 2025-12-04T09:58:55.6972250Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:55.6972292Z self.assertTrue( 2025-12-04T09:58:55.6972396Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:55.6972445Z raise self.failureException(msg) 2025-12-04T09:58:55.6972583Z AssertionError: False is not true : Log file /tmp/tmp2hax7tss/flex_attention_configs.json was not created 2025-12-04T09:58:55.6972588Z 2025-12-04T09:58:55.6972665Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.6972828Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.6972831Z 2025-12-04T09:58:55.6972923Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.6972997Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6973041Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6973079Z unimplemented [] 2025-12-04T09:58:55.6973141Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6973722Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:55.6973823Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6973860Z graph_break [] 2025-12-04T09:58:55.6973939Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6974432Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:55.6974481Z current_size = base.storage().size() 2025-12-04T09:58:55.6974523Z Autotune Choices Stats: 2025-12-04T09:58:55.6975283Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:55.6975413Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6975527Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6975698Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6976365Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6976965Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6977583Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6978186Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6978786Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6979394Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6980000Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6980621Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6981217Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6981823Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6981953Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:55.6981996Z Autotune Choices Stats: 2025-12-04T09:58:55.6982757Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.6982973Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6983148Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6983426Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6984059Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6984691Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6985321Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6985995Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6986626Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6987253Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6987888Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6988516Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6989164Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6989783Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6989928Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:55.6990003Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.6990045Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.6990082Z unimplemented [] 2025-12-04T09:58:55.6990144Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.6990244Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.6990815Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.6990854Z graph_break [] 2025-12-04T09:58:55.6990927Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.6990968Z Autotune Choices Stats: 2025-12-04T09:58:55.6991712Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:55.6991841Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6991955Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6992115Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.6992723Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6993352Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6993956Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6994566Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6995179Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6995778Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.6996415Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6997017Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6997638Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6998236Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.6998377Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:55.6998418Z Autotune Choices Stats: 2025-12-04T09:58:55.6999172Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.6999390Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.6999556Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.6999831Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7000478Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7001102Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7001748Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7002375Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7003011Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7003635Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7004263Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7004896Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7005514Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7006220Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7006348Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:55.7006423Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7006480Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7006519Z unimplemented [] 2025-12-04T09:58:55.7006579Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7006678Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7007247Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7007285Z graph_break [] 2025-12-04T09:58:55.7007358Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7007401Z Autotune Choices Stats: 2025-12-04T09:58:55.7008143Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:55.7008269Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7008383Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7008558Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7009170Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7009771Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7010394Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7011001Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7011614Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7012218Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7012829Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7013437Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7014034Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7014654Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7014781Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:55.7014822Z Autotune Choices Stats: 2025-12-04T09:58:55.7015592Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.7015807Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7016008Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7016285Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7016915Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7017556Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7018175Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7018820Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7019445Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7020086Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7020707Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7021333Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7021977Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7022600Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7022744Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:55.7022819Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7022860Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7022899Z unimplemented [] 2025-12-04T09:58:55.7022969Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7023069Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7023645Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7023692Z graph_break [] 2025-12-04T09:58:55.7023767Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7023807Z Autotune Choices Stats: 2025-12-04T09:58:55.7024552Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:55.7024678Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7024792Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7024954Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7025567Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7026205Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7026808Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7027432Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7028048Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7028651Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7029253Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7029866Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7030461Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7031069Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7031209Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:55.7031250Z Autotune Choices Stats: 2025-12-04T09:58:55.7032016Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.7032241Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7032405Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7032679Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7033315Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7033941Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7034574Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7035194Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7035840Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7036500Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7037130Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7037759Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7038402Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7039027Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7039156Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:55.7039229Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7039272Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7039322Z unimplemented [] 2025-12-04T09:58:55.7039384Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7039483Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7040071Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7040108Z graph_break [] 2025-12-04T09:58:55.7040184Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7040236Z Autotune Choices Stats: 2025-12-04T09:58:55.7040975Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.7041102Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7041214Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7041377Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7041985Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7042600Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7043200Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7043798Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7044420Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7045032Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7045636Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7046282Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7046900Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7047503Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7047633Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:55.7047673Z Autotune Choices Stats: 2025-12-04T09:58:55.7048445Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.7048674Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7048840Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7049129Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7049758Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7050387Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7051025Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7051649Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7052274Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7052925Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7053568Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7054192Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7054818Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7055453Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7055584Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:55.7055658Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7055701Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7055737Z unimplemented [] 2025-12-04T09:58:55.7055799Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7055898Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7056501Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7056552Z graph_break [] 2025-12-04T09:58:55.7056627Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7056666Z Autotune Choices Stats: 2025-12-04T09:58:55.7057421Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.7057561Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7057675Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7057839Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7058444Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7059048Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7059670Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7060274Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7060873Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7061495Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7062109Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7062706Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7063306Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7063916Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7064048Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:55.7064087Z Autotune Choices Stats: 2025-12-04T09:58:55.7064845Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.7065073Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7065238Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7065524Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7066193Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7066830Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7067453Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7068089Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7068719Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7069342Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7069994Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7070633Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7071257Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7071882Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7072009Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:55.7072083Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7072127Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7072165Z unimplemented [] 2025-12-04T09:58:55.7072239Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7072338Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7072920Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7072958Z graph_break [] 2025-12-04T09:58:55.7073030Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7073072Z Autotune Choices Stats: 2025-12-04T09:58:55.7073826Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:55.7073954Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7074067Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7074241Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7074849Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7075446Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7076085Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7076705Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7077310Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7077913Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7078531Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7079145Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7079748Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7080351Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7080480Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:55.7080520Z Autotune Choices Stats: 2025-12-04T09:58:55.7081289Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.7081508Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7081674Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7081963Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7082604Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7083236Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7083860Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7084485Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7085118Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7085743Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7086402Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7087072Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7087714Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7088338Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7088469Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:55.7088544Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7088587Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7088624Z unimplemented [] 2025-12-04T09:58:55.7088685Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7088785Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7089364Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7089403Z graph_break [] 2025-12-04T09:58:55.7089477Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7089517Z Autotune Choices Stats: 2025-12-04T09:58:55.7090257Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:55.7090395Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7090509Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7090670Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7091291Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7091902Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7092509Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7093118Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7093731Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7094331Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7094946Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7095559Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7096218Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7096817Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7096948Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:55.7096988Z Autotune Choices Stats: 2025-12-04T09:58:55.7097762Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.7097980Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7098143Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7098418Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7099046Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7099696Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7100325Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7100947Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7101581Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7102223Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7102847Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7103475Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7104129Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7104758Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7104888Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:55.7104963Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7105005Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7105044Z unimplemented [] 2025-12-04T09:58:55.7105106Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7105205Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7105777Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7105815Z graph_break [] 2025-12-04T09:58:55.7105888Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7105971Z Autotune Choices Stats: 2025-12-04T09:58:55.7106723Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.7106852Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7106966Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7107127Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7107763Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7108369Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7108981Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7109579Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7110183Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7110798Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7111401Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7112006Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7112619Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7113231Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7113359Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:55.7113402Z Autotune Choices Stats: 2025-12-04T09:58:55.7114158Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.7114377Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7114552Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7114833Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7115469Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7116126Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7116776Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7117424Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7118049Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7118674Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7119309Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7119935Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7120571Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7121202Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7121341Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:55.7121417Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7121460Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7121500Z unimplemented [] 2025-12-04T09:58:55.7121560Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7121662Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7122235Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7122274Z graph_break [] 2025-12-04T09:58:55.7122347Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7122388Z Autotune Choices Stats: 2025-12-04T09:58:55.7123141Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:55.7123268Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7123383Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7123544Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7124165Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7124788Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7125389Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7126050Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7126649Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7127251Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7127874Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7128472Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7129097Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7129699Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7129839Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:55.7129881Z Autotune Choices Stats: 2025-12-04T09:58:55.7130643Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:55.7130862Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7131026Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7131302Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7131939Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7132564Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7133202Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7133841Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7134484Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7135119Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7135740Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7136417Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7137040Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7137692Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7137821Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:55.7137897Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7137954Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7137994Z unimplemented [] 2025-12-04T09:58:55.7138055Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7138157Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7138734Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7138771Z graph_break [] 2025-12-04T09:58:55.7138845Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7138886Z Autotune Choices Stats: 2025-12-04T09:58:55.7139629Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:55.7139755Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7139869Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7140048Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7140657Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7141258Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7141889Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7142485Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7143102Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7143703Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7144305Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7144919Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7145523Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7146207Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7146337Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:55.7146380Z Autotune Choices Stats: 2025-12-04T09:58:55.7147154Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:55.7147370Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7147537Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7147814Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7148448Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7149102Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7149726Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7151284Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7151917Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7152558Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7153178Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7153807Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7154435Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7155056Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7155198Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:55.7155276Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7155319Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7155358Z unimplemented [] 2025-12-04T09:58:55.7155418Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7155565Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7156174Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7156235Z graph_break [] 2025-12-04T09:58:55.7156312Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7156354Z Autotune Choices Stats: 2025-12-04T09:58:55.7157099Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:55.7157228Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7157344Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7157505Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7158122Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7158732Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7159333Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7159996Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7160602Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7161222Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7161822Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7162427Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7163035Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7163643Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7163785Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:55.7163829Z Autotune Choices Stats: 2025-12-04T09:58:55.7164614Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:55.7164850Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7165020Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7165299Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7165977Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7166603Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7167232Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7167862Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7168530Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7169156Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7169797Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7170424Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7171053Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7171685Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7171818Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:55.7171894Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7171942Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7171992Z unimplemented [] 2025-12-04T09:58:55.7172054Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7172154Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7172745Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7172781Z graph_break [] 2025-12-04T09:58:55.7172858Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7172910Z Autotune Choices Stats: 2025-12-04T09:58:55.7173655Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:55.7173789Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7173904Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7174069Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7174684Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7175292Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7175898Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7176546Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7177188Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7177808Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7178418Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7179022Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7179631Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7180237Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7180368Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:55.7180407Z Autotune Choices Stats: 2025-12-04T09:58:55.7181170Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.7181425Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7181590Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7181881Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7182517Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7183147Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7183768Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7184394Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7185027Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7185688Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7186349Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7186986Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7187620Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7188249Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7188383Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:55.7188459Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7188501Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7188539Z unimplemented [] 2025-12-04T09:58:55.7188602Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7188702Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7189275Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7189335Z graph_break [] 2025-12-04T09:58:55.7189411Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7189452Z Autotune Choices Stats: 2025-12-04T09:58:55.7190231Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:55.7190374Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7190492Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7190661Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7191277Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7191888Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7192497Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7193103Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7193709Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7194369Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7194981Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7195585Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7196231Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7196838Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7196968Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:55.7197014Z Autotune Choices Stats: 2025-12-04T09:58:55.7197834Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:55.7198142Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7198343Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7198717Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7199382Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7200026Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7200659Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7201370Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7202021Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7202649Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7203314Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7203956Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7204586Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7205213Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7205344Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:55.7205421Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7205464Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7205503Z unimplemented [] 2025-12-04T09:58:55.7205565Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7205667Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7206277Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7206314Z graph_break [] 2025-12-04T09:58:55.7206387Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7206429Z Autotune Choices Stats: 2025-12-04T09:58:55.7207169Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:55.7207355Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7207472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7207644Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7208260Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7208867Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7209597Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7210205Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7210836Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7211436Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7212075Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7212691Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7213296Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7213901Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7214032Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:55.7214074Z Autotune Choices Stats: 2025-12-04T09:58:55.7214839Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.7215059Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7215225Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7215523Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7216233Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7216873Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7217501Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7218131Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7218761Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7219396Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7220018Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7220684Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7221328Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7221958Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7222089Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:55.7222164Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7222205Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7222242Z unimplemented [] 2025-12-04T09:58:55.7222303Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7222406Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7222980Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7223020Z graph_break [] 2025-12-04T09:58:55.7223093Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7223135Z Autotune Choices Stats: 2025-12-04T09:58:55.7223889Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.7224032Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7224146Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7224308Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7224951Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7225564Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7226205Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7226806Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7227407Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7228017Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7228623Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7229271Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7229894Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7230504Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7230635Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:55.7230679Z Autotune Choices Stats: 2025-12-04T09:58:55.7231435Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.7231652Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7231819Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7232096Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7232734Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7233391Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7234025Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7234651Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7235288Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7235913Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7236566Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7237197Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7237870Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7238511Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7238641Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:55.7238716Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7238760Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7238798Z unimplemented [] 2025-12-04T09:58:55.7238860Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7238960Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7239544Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7239583Z graph_break [] 2025-12-04T09:58:55.7239657Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7239696Z Autotune Choices Stats: 2025-12-04T09:58:55.7240439Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.7240569Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7240684Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7240846Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7241487Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7242102Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7242716Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7243317Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7243930Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7244533Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7245140Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7245751Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7246428Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7247042Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7247173Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:55.7247216Z Autotune Choices Stats: 2025-12-04T09:58:55.7247978Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.7248217Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7248384Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7248662Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7249299Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7249924Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7250585Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7251218Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7251844Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7252476Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7253116Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7253742Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7254383Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7255028Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7255167Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:55.7255242Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7255285Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7255321Z unimplemented [] 2025-12-04T09:58:55.7255382Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7255485Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7256105Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7256142Z graph_break [] 2025-12-04T09:58:55.7256219Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7256258Z Autotune Choices Stats: 2025-12-04T09:58:55.7257011Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:55.7257140Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7257255Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7257420Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7258035Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7258677Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7259282Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7259901Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7260503Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7261109Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7261720Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7262331Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7262943Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7263571Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7263715Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:55.7263755Z Autotune Choices Stats: 2025-12-04T09:58:55.7264516Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.7264733Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7264901Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7265182Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7265814Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7266477Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7267117Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7267764Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7268407Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7269036Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7269661Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7270289Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7270920Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7271566Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7271707Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:55.7271781Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7271824Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7271874Z unimplemented [] 2025-12-04T09:58:55.7271938Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7272037Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7272612Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7272649Z graph_break [] 2025-12-04T09:58:55.7272726Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7272766Z Autotune Choices Stats: 2025-12-04T09:58:55.7273515Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.7273644Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7273758Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7273921Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7274528Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7275137Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7275777Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7276414Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7277036Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7277648Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7278253Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7278858Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7279463Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7280102Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7280244Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:55.7280284Z Autotune Choices Stats: 2025-12-04T09:58:55.7281039Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.7281267Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7281431Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7281713Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7282350Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7282976Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7283605Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7284254Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7284892Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7285536Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7286194Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7286829Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7287459Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7288088Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7288234Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:55.7288309Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7288356Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7288393Z unimplemented [] 2025-12-04T09:58:55.7288455Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7288566Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7289157Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7289211Z graph_break [] 2025-12-04T09:58:55.7289285Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7289326Z Autotune Choices Stats: 2025-12-04T09:58:55.7290063Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.7290194Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7290309Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7290469Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7291084Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7291688Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7292293Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7292936Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7293539Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7294161Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7294766Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7295376Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7296013Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7296620Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7296771Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:55.7296811Z Autotune Choices Stats: 2025-12-04T09:58:55.7297599Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:55.7297833Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7297997Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7298281Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7298910Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7299543Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7300172Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7300798Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7301459Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7302082Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7302715Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7303343Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7303979Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7304607Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7304740Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:55.7304813Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7304857Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7304896Z unimplemented [] 2025-12-04T09:58:55.7304970Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7305071Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7305666Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7305705Z graph_break [] 2025-12-04T09:58:55.7305777Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7305819Z Autotune Choices Stats: 2025-12-04T09:58:55.7306609Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.7306739Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7306855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7307017Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7307636Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7308245Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7308856Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7309459Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7310117Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7310722Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7311337Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7311944Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7312545Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7313154Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7313284Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:55.7313323Z Autotune Choices Stats: 2025-12-04T09:58:55.7314083Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.7314330Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7314497Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7314787Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7315434Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7316094Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7316720Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7317353Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7317989Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7318656Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7319285Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7319927Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7320558Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7321181Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7321311Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:55.7321385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7321426Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7321467Z unimplemented [] 2025-12-04T09:58:55.7321528Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7321629Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7322209Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7322259Z graph_break [] 2025-12-04T09:58:55.7322334Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7322375Z Autotune Choices Stats: 2025-12-04T09:58:55.7323135Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:55.7323274Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7323393Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7323555Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7324178Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7324788Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7325396Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7326060Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7326668Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7327317Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7327934Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7328541Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7329149Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7329753Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7329884Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:55.7329927Z Autotune Choices Stats: 2025-12-04T09:58:55.7330688Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:55.7330918Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7331085Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7331382Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7332017Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7332660Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7333290Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7333916Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7334548Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7335184Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7335842Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7336510Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7337156Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7337787Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7337917Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:55.7337992Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7338035Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7338074Z unimplemented [] 2025-12-04T09:58:55.7338135Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7338235Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7338806Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7338846Z graph_break [] 2025-12-04T09:58:55.7338921Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7338962Z Autotune Choices Stats: 2025-12-04T09:58:55.7339702Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:55.7339868Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7339986Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7340157Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7340776Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7341388Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7341996Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7342606Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7343213Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7343822Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7344476Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7345095Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7345703Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7346348Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7346478Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:55.7346519Z Autotune Choices Stats: 2025-12-04T09:58:55.7347282Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.7347502Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7347668Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7347960Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7348622Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7349261Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7349888Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7350523Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7351158Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7351792Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7352418Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7353077Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7353716Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7354348Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7354479Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:55.7354555Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7354599Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7354638Z unimplemented [] 2025-12-04T09:58:55.7354699Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7354801Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7355380Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7355418Z graph_break [] 2025-12-04T09:58:55.7355493Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7355532Z Autotune Choices Stats: 2025-12-04T09:58:55.7356318Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:55.7356461Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7356577Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7356737Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7357370Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7358003Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7358608Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7359213Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7359821Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7360432Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7361039Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7361673Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7362295Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7362902Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7363032Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:55.7363073Z Autotune Choices Stats: 2025-12-04T09:58:55.7363831Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.7364050Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7364217Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7364499Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7365135Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7365793Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7366469Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7367098Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7367733Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7368362Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7369001Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7369632Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7370299Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7370938Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7371067Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:55.7371141Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7371185Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7371224Z unimplemented [] 2025-12-04T09:58:55.7371286Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7371386Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7371964Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7372002Z graph_break [] 2025-12-04T09:58:55.7372076Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7372117Z Autotune Choices Stats: 2025-12-04T09:58:55.7372862Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.7372992Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7373104Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7373271Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7373905Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7374526Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7375142Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7375751Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7376394Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7376991Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7377598Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7378220Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7378850Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7379469Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7379599Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:55.7379640Z Autotune Choices Stats: 2025-12-04T09:58:55.7380409Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.7380631Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7380799Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7381079Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7381712Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7382348Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7383002Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7383633Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7384264Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7384899Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7385527Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7386204Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7386852Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7387505Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7387649Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:55.7387724Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7387769Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7387807Z unimplemented [] 2025-12-04T09:58:55.7387869Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7387969Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7388544Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7388581Z graph_break [] 2025-12-04T09:58:55.7388655Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7388695Z Autotune Choices Stats: 2025-12-04T09:58:55.7389443Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:55.7389572Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7389684Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7389845Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7390460Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7391106Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7391714Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7392336Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7392939Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7393547Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7394153Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7394760Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7395384Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7396045Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7396189Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:55.7396228Z Autotune Choices Stats: 2025-12-04T09:58:55.7396990Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.7397208Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7397372Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7397653Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7398294Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7398926Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7399566Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7400216Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7400851Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7401480Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7402110Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7402748Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7403383Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7404042Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7404172Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:55.7404246Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7404300Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7404339Z unimplemented [] 2025-12-04T09:58:55.7404399Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7404499Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7405081Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7405118Z graph_break [] 2025-12-04T09:58:55.7405191Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7405234Z Autotune Choices Stats: 2025-12-04T09:58:55.7406011Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.7406141Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7406255Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7406417Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7407035Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7407644Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7408302Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7408908Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7409531Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7410134Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7410743Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7411355Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7411962Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7412597Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7412727Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:55.7412767Z Autotune Choices Stats: 2025-12-04T09:58:55.7413533Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.7413756Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7413920Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7414202Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7414837Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7418978Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7419616Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7420314Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7420946Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7421597Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7422224Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7422859Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7423489Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7424123Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7424272Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:55.7424354Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7424399Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7424437Z unimplemented [] 2025-12-04T09:58:55.7424503Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7424627Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7425201Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7425259Z graph_break [] 2025-12-04T09:58:55.7425333Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7425375Z Autotune Choices Stats: 2025-12-04T09:58:55.7426147Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:55.7426278Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7426394Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7426558Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7427169Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7427774Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7428379Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7429030Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7429630Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7430251Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7430863Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7431468Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7432072Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7432673Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7432819Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:55.7432860Z Autotune Choices Stats: 2025-12-04T09:58:55.7433674Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.7433913Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7434078Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7434357Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7434990Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7435609Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7436262Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7436887Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7437559Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7438182Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7438822Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7439452Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7440078Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7440701Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7440831Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:55.7440906Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7440951Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7441002Z unimplemented [] 2025-12-04T09:58:55.7441064Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7441166Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7441768Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7441807Z graph_break [] 2025-12-04T09:58:55.7441881Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7441936Z Autotune Choices Stats: 2025-12-04T09:58:55.7442679Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:55.7442807Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7442922Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7443087Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7443694Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7444297Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7444899Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7445498Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7446187Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7446805Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7447407Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7448008Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7448609Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7449211Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7449340Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:55.7449383Z Autotune Choices Stats: 2025-12-04T09:58:55.7450150Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.7450389Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7450556Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7450842Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7451474Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7452099Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7452725Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7453350Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7453979Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7454637Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7455273Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7455891Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7456552Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7457178Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7457309Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:55.7457384Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7457427Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7457464Z unimplemented [] 2025-12-04T09:58:55.7457526Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7457627Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7458194Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7458254Z graph_break [] 2025-12-04T09:58:55.7458327Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7458367Z Autotune Choices Stats: 2025-12-04T09:58:55.7459132Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:55.7459274Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7459389Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7459552Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7460162Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7460769Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7461366Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7461971Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7462576Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7463207Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7463819Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7464423Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7465025Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7465629Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7465758Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:55.7465798Z Autotune Choices Stats: 2025-12-04T09:58:55.7466600Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.7466840Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7467019Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7467307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7467950Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7468579Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7469203Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7469824Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7470459Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7471088Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7471743Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7472378Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7473010Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7473635Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7473766Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:55.7473840Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7473885Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7473923Z unimplemented [] 2025-12-04T09:58:55.7473985Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7474085Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7474660Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7474696Z graph_break [] 2025-12-04T09:58:55.7474772Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7474831Z Autotune Choices Stats: 2025-12-04T09:58:55.7475595Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.7475721Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7475835Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7476039Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7476648Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7477253Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7477857Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7478455Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7479056Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7479678Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7480306Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7480921Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7481521Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7482129Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7482257Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:55.7482301Z Autotune Choices Stats: 2025-12-04T09:58:55.7483062Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.7483278Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7483444Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7483736Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7484389Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7485024Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7485648Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7486336Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7486960Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7487588Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7488212Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7488890Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7489531Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7490161Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7490290Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:55.7490363Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7490407Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7490443Z unimplemented [] 2025-12-04T09:58:55.7490506Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7490605Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7491179Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7491217Z graph_break [] 2025-12-04T09:58:55.7491292Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7491331Z Autotune Choices Stats: 2025-12-04T09:58:55.7492072Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:55.7492212Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7492325Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7492511Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7493130Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7493741Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7494345Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7494948Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7495552Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7496207Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7496834Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7497470Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7498085Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7498687Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7498820Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:55.7498859Z Autotune Choices Stats: 2025-12-04T09:58:55.7499623Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.7499841Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7500007Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7500283Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7500912Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7501575Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7502215Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7502839Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7503467Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7504090Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7504715Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7505355Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7506050Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7506694Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7506824Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:55.7506899Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7506942Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7506980Z unimplemented [] 2025-12-04T09:58:55.7507043Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7507142Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7507715Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7507751Z graph_break [] 2025-12-04T09:58:55.7507827Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7507868Z Autotune Choices Stats: 2025-12-04T09:58:55.7508611Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:55.7508740Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7508854Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7509041Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7509676Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7510281Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7510903Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7511507Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7512110Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7512716Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7513329Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7513968Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7514571Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7515196Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7515327Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:55.7515370Z Autotune Choices Stats: 2025-12-04T09:58:55.7516176Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.7516396Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7516563Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7516847Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7517494Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7518161Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7518802Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7519445Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7520074Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7520705Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7521328Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7521960Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7522629Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7523256Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7523402Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:55.7523479Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7523526Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7523566Z unimplemented [] 2025-12-04T09:58:55.7523630Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7523730Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7524304Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7524347Z graph_break [] 2025-12-04T09:58:55.7524420Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7524463Z Autotune Choices Stats: 2025-12-04T09:58:55.7525197Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:55.7525327Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7525441Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7525605Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7526276Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7526920Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7527526Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7528144Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7528754Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7529357Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7529968Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7530571Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7531218Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7531820Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7531964Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:55.7532007Z Autotune Choices Stats: 2025-12-04T09:58:55.7532768Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.7532992Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7533161Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7533441Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7534073Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7534694Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7535378Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7536040Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7536687Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7537313Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7537948Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7538574Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7539203Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7539885Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7540021Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:55.7540111Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7540155Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7540199Z unimplemented [] 2025-12-04T09:58:55.7540260Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7540366Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7540942Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7540984Z graph_break [] 2025-12-04T09:58:55.7541060Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7541106Z Autotune Choices Stats: 2025-12-04T09:58:55.7541844Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.7541975Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7542094Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7542258Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7542876Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7543481Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7544119Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7544736Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7545346Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7545998Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7546612Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7547221Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7547828Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7548484Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7548622Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:55.7548685Z Autotune Choices Stats: 2025-12-04T09:58:55.7549449Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.7549673Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7549846Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7550125Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7550762Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7551394Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7552018Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7552691Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7553325Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7553981Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7554604Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7555237Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7555877Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7556540Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7556701Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:55.7556785Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7556830Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7556897Z unimplemented [] 2025-12-04T09:58:55.7556978Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7557086Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7557664Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7557727Z graph_break [] 2025-12-04T09:58:55.7557806Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7557856Z Autotune Choices Stats: 2025-12-04T09:58:55.7558603Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:55.7558732Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7558853Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7559017Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7559632Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7560248Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7560851Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7561499Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7562126Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7562735Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7563345Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7563954Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7564572Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7565178Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7565323Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:55.7565372Z Autotune Choices Stats: 2025-12-04T09:58:55.7566247Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.7566487Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7566661Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7566945Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7567590Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7568220Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7568847Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7569484Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7570202Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7570846Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7571479Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7572115Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7572746Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7573377Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7573511Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:55.7573609Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7573655Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7573702Z unimplemented [] 2025-12-04T09:58:55.7573765Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7573871Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7574478Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7574532Z graph_break [] 2025-12-04T09:58:55.7574612Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7574657Z Autotune Choices Stats: 2025-12-04T09:58:55.7575397Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:55.7575528Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7575652Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7575816Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7576508Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7577113Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7577734Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7578350Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7579010Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7579636Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7580246Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7580855Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7581461Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7582068Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7582200Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:55.7582262Z Autotune Choices Stats: 2025-12-04T09:58:55.7583059Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.7583278Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7583461Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7583737Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7584374Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7585009Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7585641Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7586313Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7586942Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7587625Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7588264Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7588893Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7589527Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7590157Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7590289Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:55.7590371Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7590420Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7590460Z unimplemented [] 2025-12-04T09:58:55.7590523Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7590628Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7591203Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7591257Z graph_break [] 2025-12-04T09:58:55.7591338Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7591380Z Autotune Choices Stats: 2025-12-04T09:58:55.7592149Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1708", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.7592293Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7592409Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7592580Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7593187Z triton_flex_attention_1708 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7593794Z triton_flex_attention_1709 0.0109 ms 96.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7594398Z triton_flex_attention_1707 0.0117 ms 89.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7595001Z triton_flex_attention_1705 0.0130 ms 80.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7595624Z triton_flex_attention_1724 0.0135 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7596318Z triton_flex_attention_1706 0.0136 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7596940Z triton_flex_attention_1716 0.0142 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7597543Z triton_flex_attention_1722 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7598154Z triton_flex_attention_1714 0.0149 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7598764Z triton_flex_attention_1720 0.0162 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7598900Z SingleProcess AUTOTUNE benchmarking takes 0.2434 seconds and 0.4106 seconds precompiling for 24 choices 2025-12-04T09:58:55.7598941Z Autotune Choices Stats: 2025-12-04T09:58:55.7599705Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015359999611973763, "best_triton_pos": 0} 2025-12-04T09:58:55.7599945Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7600139Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7600421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7601071Z triton_flex_attention_backward_1743 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7601699Z triton_flex_attention_backward_1737 0.0181 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7602329Z triton_flex_attention_backward_1734 0.0187 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7602959Z triton_flex_attention_backward_1735 0.0188 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7603590Z triton_flex_attention_backward_1745 0.0203 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7604222Z triton_flex_attention_backward_1744 0.0203 ms 75.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7604886Z triton_flex_attention_backward_1742 0.0218 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7605536Z triton_flex_attention_backward_1747 0.0220 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7606200Z triton_flex_attention_backward_1738 0.0228 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7606830Z triton_flex_attention_backward_1729 0.0230 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7606965Z SingleProcess AUTOTUNE benchmarking takes 0.2527 seconds and 0.7882 seconds precompiling for 22 choices 2025-12-04T09:58:55.7607043Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7607091Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7607132Z unimplemented [] 2025-12-04T09:58:55.7607196Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7607297Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7607886Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7607926Z graph_break [] 2025-12-04T09:58:55.7608030Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7608073Z Autotune Choices Stats: 2025-12-04T09:58:55.7608842Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1754", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009398999623954296, "best_triton_pos": 0} 2025-12-04T09:58:55.7608976Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7609107Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7609278Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7609897Z triton_flex_attention_1754 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7610507Z triton_flex_attention_1755 0.0104 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7611114Z triton_flex_attention_1752 0.0112 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7611724Z triton_flex_attention_1753 0.0117 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7612340Z triton_flex_attention_1750 0.0120 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7612971Z triton_flex_attention_1770 0.0132 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7613590Z triton_flex_attention_1751 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7614213Z triton_flex_attention_1762 0.0140 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7614819Z triton_flex_attention_1768 0.0146 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7615428Z triton_flex_attention_1760 0.0150 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7615563Z SingleProcess AUTOTUNE benchmarking takes 0.2227 seconds and 0.4678 seconds precompiling for 24 choices 2025-12-04T09:58:55.7615607Z Autotune Choices Stats: 2025-12-04T09:58:55.7616420Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.7616645Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7616825Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7617107Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7617769Z triton_flex_attention_backward_1789 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7618420Z triton_flex_attention_backward_1783 0.0184 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7619052Z triton_flex_attention_backward_1780 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7619681Z triton_flex_attention_backward_1781 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7620307Z triton_flex_attention_backward_1791 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7620941Z triton_flex_attention_backward_1790 0.0204 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7621590Z triton_flex_attention_backward_1788 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7622232Z triton_flex_attention_backward_1793 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7622874Z triton_flex_attention_backward_1784 0.0226 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7623506Z triton_flex_attention_backward_1775 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7623643Z SingleProcess AUTOTUNE benchmarking takes 0.2632 seconds and 0.8758 seconds precompiling for 22 choices 2025-12-04T09:58:55.7623718Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7623767Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7623806Z unimplemented [] 2025-12-04T09:58:55.7623875Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7623978Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7624556Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7624602Z graph_break [] 2025-12-04T09:58:55.7624679Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7624726Z Autotune Choices Stats: 2025-12-04T09:58:55.7625464Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1801", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.7625606Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7625735Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7625914Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7626565Z triton_flex_attention_1801 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7627180Z triton_flex_attention_1800 0.0108 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7627795Z triton_flex_attention_1816 0.0128 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7628404Z triton_flex_attention_1798 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7629010Z triton_flex_attention_1797 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7629613Z triton_flex_attention_1808 0.0133 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7630262Z triton_flex_attention_1814 0.0140 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7630866Z triton_flex_attention_1806 0.0150 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7631480Z triton_flex_attention_1799 0.0158 ms 66.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7632088Z triton_flex_attention_1812 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7632224Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4169 seconds precompiling for 24 choices 2025-12-04T09:58:55.7632268Z Autotune Choices Stats: 2025-12-04T09:58:55.7633033Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1835", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.7633254Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7633420Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7633702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7634367Z triton_flex_attention_backward_1835 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7635011Z triton_flex_attention_backward_1829 0.0184 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7635651Z triton_flex_attention_backward_1826 0.0186 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7636319Z triton_flex_attention_backward_1827 0.0186 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7636962Z triton_flex_attention_backward_1837 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7637592Z triton_flex_attention_backward_1836 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7638220Z triton_flex_attention_backward_1834 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7638894Z triton_flex_attention_backward_1839 0.0221 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7639525Z triton_flex_attention_backward_1830 0.0228 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7640169Z triton_flex_attention_backward_1821 0.0230 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7640305Z SingleProcess AUTOTUNE benchmarking takes 0.2624 seconds and 0.8439 seconds precompiling for 22 choices 2025-12-04T09:58:55.7640383Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7640432Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7640471Z unimplemented [] 2025-12-04T09:58:55.7640538Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7640641Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7641226Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7641273Z graph_break [] 2025-12-04T09:58:55.7641351Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7641399Z Autotune Choices Stats: 2025-12-04T09:58:55.7642140Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1846", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009399999864399433, "best_triton_pos": 0} 2025-12-04T09:58:55.7642273Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7642405Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7642567Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7643215Z triton_flex_attention_1846 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7643833Z triton_flex_attention_1844 0.0102 ms 91.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7644443Z triton_flex_attention_1845 0.0120 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7645049Z triton_flex_attention_1843 0.0130 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7645664Z triton_flex_attention_1854 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7646439Z triton_flex_attention_1862 0.0134 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7647054Z triton_flex_attention_1842 0.0137 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7647705Z triton_flex_attention_1847 0.0138 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7648315Z triton_flex_attention_1860 0.0144 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7648931Z triton_flex_attention_1852 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7649066Z SingleProcess AUTOTUNE benchmarking takes 0.2274 seconds and 0.3833 seconds precompiling for 24 choices 2025-12-04T09:58:55.7649116Z Autotune Choices Stats: 2025-12-04T09:58:55.7649876Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01583999954164028, "best_triton_pos": 0} 2025-12-04T09:58:55.7650098Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7650266Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7650550Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7651184Z triton_flex_attention_backward_1881 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7651847Z triton_flex_attention_backward_1875 0.0184 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7652482Z triton_flex_attention_backward_1873 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7653126Z triton_flex_attention_backward_1872 0.0188 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7653759Z triton_flex_attention_backward_1883 0.0201 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7654385Z triton_flex_attention_backward_1882 0.0202 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7655022Z triton_flex_attention_backward_1880 0.0220 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7655652Z triton_flex_attention_backward_1885 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7656369Z triton_flex_attention_backward_1876 0.0224 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7656996Z triton_flex_attention_backward_1867 0.0232 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7657146Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.7872 seconds precompiling for 22 choices 2025-12-04T09:58:55.7657230Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7657276Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7657320Z unimplemented [] 2025-12-04T09:58:55.7657383Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7657489Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7658069Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7658113Z graph_break [] 2025-12-04T09:58:55.7658190Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7658238Z Autotune Choices Stats: 2025-12-04T09:58:55.7658974Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1893", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.7659108Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7659231Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7659395Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7660010Z triton_flex_attention_1893 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7660651Z triton_flex_attention_1892 0.0106 ms 95.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7661266Z triton_flex_attention_1891 0.0117 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7661876Z triton_flex_attention_1890 0.0127 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7662486Z triton_flex_attention_1908 0.0130 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7663090Z triton_flex_attention_1889 0.0132 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7663699Z triton_flex_attention_1900 0.0135 ms 74.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7664309Z triton_flex_attention_1906 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7664945Z triton_flex_attention_1898 0.0148 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7665563Z triton_flex_attention_1904 0.0162 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7665694Z SingleProcess AUTOTUNE benchmarking takes 0.2478 seconds and 0.5052 seconds precompiling for 24 choices 2025-12-04T09:58:55.7665743Z Autotune Choices Stats: 2025-12-04T09:58:55.7666545Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1927", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.7666768Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7666939Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7667215Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7667855Z triton_flex_attention_backward_1927 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7668488Z triton_flex_attention_backward_1921 0.0183 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7669166Z triton_flex_attention_backward_1918 0.0185 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7669794Z triton_flex_attention_backward_1919 0.0186 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7670439Z triton_flex_attention_backward_1929 0.0201 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7671075Z triton_flex_attention_backward_1928 0.0202 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7671702Z triton_flex_attention_backward_1926 0.0217 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7672340Z triton_flex_attention_backward_1931 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7672978Z triton_flex_attention_backward_1922 0.0227 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7673636Z triton_flex_attention_backward_1913 0.0230 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7673776Z SingleProcess AUTOTUNE benchmarking takes 0.2709 seconds and 0.9165 seconds precompiling for 22 choices 2025-12-04T09:58:55.7673858Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7673902Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7673944Z unimplemented [] 2025-12-04T09:58:55.7674008Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7674111Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7674684Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7674737Z graph_break [] 2025-12-04T09:58:55.7674817Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7674860Z Autotune Choices Stats: 2025-12-04T09:58:55.7675613Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1938", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009960000403225422, "best_triton_pos": 0} 2025-12-04T09:58:55.7675742Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7675865Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7676062Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7676675Z triton_flex_attention_1938 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7677279Z triton_flex_attention_1936 0.0100 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7677931Z triton_flex_attention_1939 0.0101 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7678547Z triton_flex_attention_1935 0.0129 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7679153Z triton_flex_attention_1937 0.0134 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7679759Z triton_flex_attention_1946 0.0137 ms 72.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7680362Z triton_flex_attention_1954 0.0139 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7680975Z triton_flex_attention_1952 0.0146 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7681583Z triton_flex_attention_1944 0.0151 ms 66.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7682229Z triton_flex_attention_1950 0.0165 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7682389Z SingleProcess AUTOTUNE benchmarking takes 0.2498 seconds and 0.4270 seconds precompiling for 24 choices 2025-12-04T09:58:55.7682435Z Autotune Choices Stats: 2025-12-04T09:58:55.7683202Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1973", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.7683420Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7683596Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7683878Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7684518Z triton_flex_attention_backward_1973 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7685153Z triton_flex_attention_backward_1967 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7685781Z triton_flex_attention_backward_1964 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7686507Z triton_flex_attention_backward_1965 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7687153Z triton_flex_attention_backward_1975 0.0199 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7687787Z triton_flex_attention_backward_1974 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7688419Z triton_flex_attention_backward_1972 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7689050Z triton_flex_attention_backward_1977 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7689678Z triton_flex_attention_backward_1968 0.0226 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7690307Z triton_flex_attention_backward_1959 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7690454Z SingleProcess AUTOTUNE benchmarking takes 0.2677 seconds and 0.8736 seconds precompiling for 22 choices 2025-12-04T09:58:55.7690555Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7690600Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7690644Z unimplemented [] 2025-12-04T09:58:55.7690708Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7690818Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7691402Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7691442Z graph_break [] 2025-12-04T09:58:55.7691524Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7691567Z Autotune Choices Stats: 2025-12-04T09:58:55.7692303Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1984", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009600000455975533, "best_triton_pos": 0} 2025-12-04T09:58:55.7692434Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7692554Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7692723Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7693342Z triton_flex_attention_1984 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7693956Z triton_flex_attention_1982 0.0101 ms 94.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7694571Z triton_flex_attention_1983 0.0116 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7695225Z triton_flex_attention_2000 0.0130 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7695845Z triton_flex_attention_1985 0.0132 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7696483Z triton_flex_attention_1981 0.0133 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7697094Z triton_flex_attention_1992 0.0137 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7697700Z triton_flex_attention_1998 0.0140 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7698311Z triton_flex_attention_1990 0.0150 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7698916Z triton_flex_attention_1996 0.0162 ms 59.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7699063Z SingleProcess AUTOTUNE benchmarking takes 0.2470 seconds and 0.3620 seconds precompiling for 24 choices 2025-12-04T09:58:55.7699122Z Autotune Choices Stats: 2025-12-04T09:58:55.7699902Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2019", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.7700136Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7700307Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7700591Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7701225Z triton_flex_attention_backward_2019 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7701853Z triton_flex_attention_backward_2013 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7702486Z triton_flex_attention_backward_2010 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7703110Z triton_flex_attention_backward_2011 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7703772Z triton_flex_attention_backward_2021 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7704423Z triton_flex_attention_backward_2020 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7705050Z triton_flex_attention_backward_2018 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7705685Z triton_flex_attention_backward_2023 0.0222 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7706354Z triton_flex_attention_backward_2014 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7706987Z triton_flex_attention_backward_2005 0.0232 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7707138Z SingleProcess AUTOTUNE benchmarking takes 0.2594 seconds and 0.8715 seconds precompiling for 22 choices 2025-12-04T09:58:55.7707215Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7707263Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7707304Z unimplemented [] 2025-12-04T09:58:55.7707372Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7707475Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7708092Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7708146Z graph_break [] 2025-12-04T09:58:55.7708227Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7708270Z Autotune Choices Stats: 2025-12-04T09:58:55.7709019Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2030", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.7709150Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7709266Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7709429Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7710034Z triton_flex_attention_2030 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7710636Z triton_flex_attention_2031 0.0108 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7711245Z triton_flex_attention_2026 0.0112 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7711870Z triton_flex_attention_2028 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7712484Z triton_flex_attention_2029 0.0116 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7713106Z triton_flex_attention_2046 0.0132 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7713710Z triton_flex_attention_2027 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7714317Z triton_flex_attention_2038 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7714917Z triton_flex_attention_2044 0.0144 ms 64.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7715526Z triton_flex_attention_2024 0.0147 ms 63.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7715675Z SingleProcess AUTOTUNE benchmarking takes 0.1936 seconds and 0.4021 seconds precompiling for 24 choices 2025-12-04T09:58:55.7715718Z Autotune Choices Stats: 2025-12-04T09:58:55.7716637Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2065", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.7716871Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7717039Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7717317Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7717954Z triton_flex_attention_backward_2065 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7718587Z triton_flex_attention_backward_2059 0.0182 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7719214Z triton_flex_attention_backward_2056 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7719851Z triton_flex_attention_backward_2057 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7720495Z triton_flex_attention_backward_2066 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7721149Z triton_flex_attention_backward_2067 0.0200 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7721786Z triton_flex_attention_backward_2064 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7722418Z triton_flex_attention_backward_2069 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7723047Z triton_flex_attention_backward_2060 0.0224 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7723677Z triton_flex_attention_backward_2051 0.0230 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7723816Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.8209 seconds precompiling for 22 choices 2025-12-04T09:58:55.7723894Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7723945Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7723985Z unimplemented [] 2025-12-04T09:58:55.7724051Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7724153Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7724748Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7724792Z graph_break [] 2025-12-04T09:58:55.7724891Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7724934Z Autotune Choices Stats: 2025-12-04T09:58:55.7725678Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2077", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010320000350475311, "best_triton_pos": 0} 2025-12-04T09:58:55.7725825Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7725975Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7726142Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7726764Z triton_flex_attention_2077 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7727373Z triton_flex_attention_2074 0.0118 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7727983Z triton_flex_attention_2076 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7728588Z triton_flex_attention_2073 0.0130 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7729237Z triton_flex_attention_2084 0.0136 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7729844Z triton_flex_attention_2092 0.0139 ms 74.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7730468Z triton_flex_attention_2090 0.0144 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7731069Z triton_flex_attention_2082 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7731678Z triton_flex_attention_2075 0.0154 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7732429Z triton_flex_attention_2088 0.0165 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7732563Z SingleProcess AUTOTUNE benchmarking takes 0.2499 seconds and 0.3908 seconds precompiling for 24 choices 2025-12-04T09:58:55.7732607Z Autotune Choices Stats: 2025-12-04T09:58:55.7733360Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2111", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.7733597Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7733787Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7734078Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7734714Z triton_flex_attention_backward_2111 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7735337Z triton_flex_attention_backward_2105 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7736003Z triton_flex_attention_backward_2110 0.0181 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7736628Z triton_flex_attention_backward_2102 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7737249Z triton_flex_attention_backward_2103 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7737903Z triton_flex_attention_backward_2113 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7738548Z triton_flex_attention_backward_2112 0.0204 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7739191Z triton_flex_attention_backward_2115 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7739814Z triton_flex_attention_backward_2097 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7740445Z triton_flex_attention_backward_2106 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7740578Z SingleProcess AUTOTUNE benchmarking takes 0.4709 seconds and 0.7187 seconds precompiling for 22 choices 2025-12-04T09:58:55.7740654Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7740700Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7740738Z unimplemented [] 2025-12-04T09:58:55.7740802Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7740903Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7741482Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7741538Z graph_break [] 2025-12-04T09:58:55.7741614Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7741662Z Autotune Choices Stats: 2025-12-04T09:58:55.7742423Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2122", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008960000239312649, "best_triton_pos": 0} 2025-12-04T09:58:55.7742575Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7742695Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7742858Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7743475Z triton_flex_attention_2122 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7744076Z triton_flex_attention_2123 0.0100 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7744685Z triton_flex_attention_2119 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7745299Z triton_flex_attention_2121 0.0133 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7745911Z triton_flex_attention_2138 0.0134 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7746592Z triton_flex_attention_2130 0.0139 ms 64.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7747201Z triton_flex_attention_2120 0.0142 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7747822Z triton_flex_attention_2136 0.0145 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7748425Z triton_flex_attention_2128 0.0149 ms 60.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7749029Z triton_flex_attention_2134 0.0166 ms 53.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7749165Z SingleProcess AUTOTUNE benchmarking takes 0.2470 seconds and 0.4797 seconds precompiling for 24 choices 2025-12-04T09:58:55.7749209Z Autotune Choices Stats: 2025-12-04T09:58:55.7749978Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2157", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.7750202Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7750382Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7750680Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7751331Z triton_flex_attention_backward_2157 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7751976Z triton_flex_attention_backward_2151 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7752604Z triton_flex_attention_backward_2149 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7753232Z triton_flex_attention_backward_2148 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7753871Z triton_flex_attention_backward_2159 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7754499Z triton_flex_attention_backward_2158 0.0203 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7755159Z triton_flex_attention_backward_2156 0.0216 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7755790Z triton_flex_attention_backward_2161 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7756465Z triton_flex_attention_backward_2152 0.0228 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7757088Z triton_flex_attention_backward_2143 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7757221Z SingleProcess AUTOTUNE benchmarking takes 0.2555 seconds and 0.9394 seconds precompiling for 22 choices 2025-12-04T09:58:55.7757302Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7757347Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7757390Z unimplemented [] 2025-12-04T09:58:55.7757453Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7757558Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7758133Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7758175Z graph_break [] 2025-12-04T09:58:55.7758253Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7758298Z Autotune Choices Stats: 2025-12-04T09:58:55.7759040Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2168", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009200000204145908, "best_triton_pos": 0} 2025-12-04T09:58:55.7759189Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7759333Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7759496Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7760117Z triton_flex_attention_2168 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7760724Z triton_flex_attention_2166 0.0101 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7761328Z triton_flex_attention_2169 0.0104 ms 88.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7761934Z triton_flex_attention_2167 0.0113 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7762546Z triton_flex_attention_2184 0.0132 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7763154Z triton_flex_attention_2165 0.0133 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7763793Z triton_flex_attention_2176 0.0135 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7764398Z triton_flex_attention_2182 0.0140 ms 65.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7765021Z triton_flex_attention_2174 0.0150 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7765628Z triton_flex_attention_2180 0.0164 ms 56.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7765760Z SingleProcess AUTOTUNE benchmarking takes 0.2350 seconds and 0.4301 seconds precompiling for 24 choices 2025-12-04T09:58:55.7765808Z Autotune Choices Stats: 2025-12-04T09:58:55.7766630Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2203", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.7766851Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7767020Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7767304Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7767977Z triton_flex_attention_backward_2203 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7768608Z triton_flex_attention_backward_2197 0.0181 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7769249Z triton_flex_attention_backward_2195 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7769879Z triton_flex_attention_backward_2194 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7770510Z triton_flex_attention_backward_2205 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7771140Z triton_flex_attention_backward_2204 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7771766Z triton_flex_attention_backward_2202 0.0217 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7772426Z triton_flex_attention_backward_2207 0.0219 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7773057Z triton_flex_attention_backward_2198 0.0227 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7773696Z triton_flex_attention_backward_2189 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7773828Z SingleProcess AUTOTUNE benchmarking takes 0.2634 seconds and 0.7312 seconds precompiling for 22 choices 2025-12-04T09:58:55.7773928Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:55.7773978Z Traceback (most recent call last): 2025-12-04T09:58:55.7774139Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:55.7774182Z self.assertTrue( 2025-12-04T09:58:55.7774300Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:55.7774354Z raise self.failureException(msg) 2025-12-04T09:58:55.7774486Z AssertionError: False is not true : Log file /tmp/tmp2um3xr6u/flex_attention_configs.json was not created 2025-12-04T09:58:55.7774490Z 2025-12-04T09:58:55.7774569Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.7774740Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.7774743Z 2025-12-04T09:58:55.7774837Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.7774919Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7774963Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7775007Z unimplemented [] 2025-12-04T09:58:55.7775070Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7775648Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:55.7775765Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7775806Z graph_break [] 2025-12-04T09:58:55.7775886Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7776439Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:55.7776492Z current_size = base.storage().size() 2025-12-04T09:58:55.7776535Z Autotune Choices Stats: 2025-12-04T09:58:55.7777286Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:55.7777434Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7777551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7777717Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7778328Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7778944Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7779554Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7780156Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7780806Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7781416Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7782033Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7782634Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7783240Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7783850Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7783986Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:55.7784030Z Autotune Choices Stats: 2025-12-04T09:58:55.7784793Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.7785047Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7785220Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7785517Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7786179Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7786806Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7787431Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7788058Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7788684Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7789360Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7789987Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7790631Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7791254Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7791880Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7792014Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:55.7792092Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7792142Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7792183Z unimplemented [] 2025-12-04T09:58:55.7792252Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7792353Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7792933Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7792984Z graph_break [] 2025-12-04T09:58:55.7793064Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7793106Z Autotune Choices Stats: 2025-12-04T09:58:55.7793865Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:55.7794009Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7794128Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7794297Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7794910Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7795514Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7796160Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7796770Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7797371Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7798019Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7798624Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7799247Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7799844Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7800450Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7800592Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:55.7800636Z Autotune Choices Stats: 2025-12-04T09:58:55.7801409Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.7801631Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7801807Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7802116Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7802744Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7803384Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7804004Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7804629Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7805250Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7805881Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7806592Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7807220Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7807860Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7808484Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7808620Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:55.7808699Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7808749Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7808790Z unimplemented [] 2025-12-04T09:58:55.7808858Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7808960Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7809538Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7809581Z graph_break [] 2025-12-04T09:58:55.7809659Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7809705Z Autotune Choices Stats: 2025-12-04T09:58:55.7810440Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:55.7810588Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7810723Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7810890Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7811518Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7812118Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7812729Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7813336Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7813950Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7814555Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7815194Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7815798Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7816455Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7817062Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7817198Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:55.7817240Z Autotune Choices Stats: 2025-12-04T09:58:55.7818003Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.7818227Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7818395Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7818676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7819370Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7819994Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7820635Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7821265Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7821900Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7822522Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7823149Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7823817Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7824444Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7825074Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7825210Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:55.7825292Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7825336Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7825380Z unimplemented [] 2025-12-04T09:58:55.7825445Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7825549Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7826164Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7826211Z graph_break [] 2025-12-04T09:58:55.7826287Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7826336Z Autotune Choices Stats: 2025-12-04T09:58:55.7827082Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:55.7827215Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7827352Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7827512Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7828163Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7828783Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7829389Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7829995Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7830612Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7831220Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7831826Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7832473Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7833084Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7833689Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7833825Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:55.7833869Z Autotune Choices Stats: 2025-12-04T09:58:55.7834629Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.7834853Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7835024Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7835303Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7835972Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7836636Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7837262Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7837905Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7838534Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7839169Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7839792Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7840421Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7841082Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7841723Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7841854Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:55.7841938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7841982Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7842028Z unimplemented [] 2025-12-04T09:58:55.7842093Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7842200Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7842770Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7842814Z graph_break [] 2025-12-04T09:58:55.7842894Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7842942Z Autotune Choices Stats: 2025-12-04T09:58:55.7843694Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.7843825Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7843946Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7844109Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7844722Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7845368Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7846026Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7846631Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7847237Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7847841Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7848448Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7849053Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7849700Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7850319Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7850452Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:55.7850499Z Autotune Choices Stats: 2025-12-04T09:58:55.7851258Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.7851482Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7851652Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7851932Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7852576Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7853204Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7853857Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7854497Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7855130Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7855763Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7856429Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7857061Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7857691Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7858363Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7858508Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:55.7858590Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7858632Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7858675Z unimplemented [] 2025-12-04T09:58:55.7858739Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7858844Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7859424Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7859469Z graph_break [] 2025-12-04T09:58:55.7859549Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7859592Z Autotune Choices Stats: 2025-12-04T09:58:55.7860334Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.7860462Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7860582Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7860746Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7861363Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7861994Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7862618Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7863237Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7863843Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7864455Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7865054Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7865656Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7866307Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7866948Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7867091Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:55.7867141Z Autotune Choices Stats: 2025-12-04T09:58:55.7867899Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.7868117Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7868294Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7868574Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7869211Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7869842Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7870471Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7871126Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7871766Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7872397Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7873027Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7873652Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7874282Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7874912Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7875056Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:55.7875162Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7875209Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7875252Z unimplemented [] 2025-12-04T09:58:55.7875317Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7875442Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7876055Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7876095Z graph_break [] 2025-12-04T09:58:55.7876177Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7876218Z Autotune Choices Stats: 2025-12-04T09:58:55.7876959Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:55.7877087Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7877209Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7877373Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7877983Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7878594Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7879224Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7879858Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7880476Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7881082Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7881691Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7882295Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7882903Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7883507Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7883668Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:55.7883723Z Autotune Choices Stats: 2025-12-04T09:58:55.7884483Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.7884717Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7884889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7885172Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7885804Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7886474Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7887108Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7887731Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7888419Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7889066Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7889693Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7890318Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7890950Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7891578Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7891724Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:55.7891800Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7891848Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7891887Z unimplemented [] 2025-12-04T09:58:55.7891952Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7892053Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7892660Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7892709Z graph_break [] 2025-12-04T09:58:55.7892789Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7892832Z Autotune Choices Stats: 2025-12-04T09:58:55.7893573Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:55.7893706Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7893823Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7893989Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7894596Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7895203Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7895812Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7896491Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7897106Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7897730Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7898335Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7898940Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7899543Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7900155Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7900310Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:55.7900356Z Autotune Choices Stats: 2025-12-04T09:58:55.7901147Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.7901380Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7901547Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7901829Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7902467Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7903096Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7903727Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7904364Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7905000Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7905651Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7906325Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7906956Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7907581Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7908208Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7908345Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:55.7908424Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7908474Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7908516Z unimplemented [] 2025-12-04T09:58:55.7908584Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7908687Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7909282Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7909321Z graph_break [] 2025-12-04T09:58:55.7909433Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7909476Z Autotune Choices Stats: 2025-12-04T09:58:55.7910217Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.7910365Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7910482Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7910648Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7911267Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7911871Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7912479Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7913088Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7913730Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7914335Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7914955Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7915560Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7916206Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7916810Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7916949Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:55.7916994Z Autotune Choices Stats: 2025-12-04T09:58:55.7917766Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.7918004Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7918201Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7918497Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7919135Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7919762Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7920393Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7921026Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7921658Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7922312Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7922953Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7923590Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7924220Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7924847Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7924983Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:55.7925059Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7925106Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7925147Z unimplemented [] 2025-12-04T09:58:55.7925213Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7925315Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7925894Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7925988Z graph_break [] 2025-12-04T09:58:55.7926064Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7926111Z Autotune Choices Stats: 2025-12-04T09:58:55.7926883Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:55.7927031Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7927147Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7931044Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7931668Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7932265Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7932872Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7933473Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7934070Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7934719Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7935323Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7935987Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7936587Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7937192Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7937326Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:55.7937370Z Autotune Choices Stats: 2025-12-04T09:58:55.7938131Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:55.7938350Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7938530Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7938820Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7939460Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7940097Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7940716Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7941339Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7941967Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7942592Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7943244Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7943875Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7944517Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7945140Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7945270Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:55.7945349Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7945393Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7945432Z unimplemented [] 2025-12-04T09:58:55.7945493Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7945595Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7946209Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7946247Z graph_break [] 2025-12-04T09:58:55.7946322Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7946364Z Autotune Choices Stats: 2025-12-04T09:58:55.7947103Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:55.7947253Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7947394Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7947556Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7948176Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7948774Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7949367Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7950131Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7950735Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7951331Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7951970Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7952569Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7953186Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7953783Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7953913Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:55.7953955Z Autotune Choices Stats: 2025-12-04T09:58:55.7954707Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:55.7954925Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7955093Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7955369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7956078Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7956701Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7957338Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7957962Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7958590Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7959215Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7959838Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7960509Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7961132Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7961774Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7961903Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:55.7961978Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7962021Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7962061Z unimplemented [] 2025-12-04T09:58:55.7962123Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7962223Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7962793Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7962833Z graph_break [] 2025-12-04T09:58:55.7962906Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7962949Z Autotune Choices Stats: 2025-12-04T09:58:55.7963691Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:55.7963819Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7963947Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7964108Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7964741Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7965343Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7965978Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7966581Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7967178Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7967784Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7968384Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7969029Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7969628Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7970245Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7970374Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:55.7970415Z Autotune Choices Stats: 2025-12-04T09:58:55.7971168Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:55.7971385Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7971552Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7971828Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7972457Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7973120Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7973738Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7974377Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7974997Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7975621Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7976271Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7976895Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7977581Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7978202Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7978342Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:55.7978420Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7978461Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7978499Z unimplemented [] 2025-12-04T09:58:55.7978560Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7978661Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7979233Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.7979272Z graph_break [] 2025-12-04T09:58:55.7979347Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7979387Z Autotune Choices Stats: 2025-12-04T09:58:55.7980120Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:55.7980247Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7980362Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7980522Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7981131Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7981767Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7982377Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7982977Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7983580Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7984188Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7984796Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7985393Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7986070Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7986687Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7986816Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:55.7986858Z Autotune Choices Stats: 2025-12-04T09:58:55.7987616Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.7987833Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7987999Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7988281Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7988915Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7989531Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7990187Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7990813Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7991447Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7992073Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7992701Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7993329Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7993954Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7994615Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7994755Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:55.7994832Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.7994875Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.7994913Z unimplemented [] 2025-12-04T09:58:55.7994973Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.7995073Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.7995643Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.7995681Z graph_break [] 2025-12-04T09:58:55.7995755Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.7995794Z Autotune Choices Stats: 2025-12-04T09:58:55.7996577Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:55.7996704Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.7996820Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.7996981Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.7997596Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7998200Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.7998839Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.7999451Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8000055Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8000658Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8001256Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8001862Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8002460Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8003093Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8003236Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:55.8003276Z Autotune Choices Stats: 2025-12-04T09:58:55.8004044Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:55.8004258Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8004423Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8004701Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8005330Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8006000Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8006623Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8007285Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8007925Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8008553Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8009181Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8009804Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8010435Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8011056Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8011197Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:55.8011290Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8011334Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8011372Z unimplemented [] 2025-12-04T09:58:55.8011433Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8011531Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8012115Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8012151Z graph_break [] 2025-12-04T09:58:55.8012226Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8012267Z Autotune Choices Stats: 2025-12-04T09:58:55.8013008Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:55.8013138Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8013252Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8013414Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8014019Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8014622Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8015238Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8015859Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8016511Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8017119Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8017731Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8018336Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8018939Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8019546Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8019693Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:55.8019747Z Autotune Choices Stats: 2025-12-04T09:58:55.8020516Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.8020752Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8020916Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8021194Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8021833Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8022455Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8023078Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8023700Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8024363Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8024997Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8025619Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8026293Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8026910Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8027533Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8027663Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:55.8027754Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8027798Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8027835Z unimplemented [] 2025-12-04T09:58:55.8027897Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8027994Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8028593Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8028643Z graph_break [] 2025-12-04T09:58:55.8028717Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8028757Z Autotune Choices Stats: 2025-12-04T09:58:55.8029497Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.8029624Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8029738Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8029897Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8030514Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8031116Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8031719Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8032329Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8032953Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8033568Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8034166Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8034768Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8035371Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8036016Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8036160Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:55.8036200Z Autotune Choices Stats: 2025-12-04T09:58:55.8036984Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.8037202Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8037380Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8037655Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8038278Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8038902Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8039534Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8040156Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8040782Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8041445Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8042081Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8042704Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8043329Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8043953Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8044083Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:55.8044159Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8044203Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8044240Z unimplemented [] 2025-12-04T09:58:55.8044302Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8044400Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8044988Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8045026Z graph_break [] 2025-12-04T09:58:55.8045111Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8045162Z Autotune Choices Stats: 2025-12-04T09:58:55.8045898Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.8046076Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8046191Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8046352Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8046964Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8047562Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8048164Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8048769Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8049401Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8050015Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8050628Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8051227Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8051830Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8052429Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8052559Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:55.8052599Z Autotune Choices Stats: 2025-12-04T09:58:55.8053358Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.8053587Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8053771Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8054048Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8054684Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8055303Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8055960Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8056582Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8057213Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8057862Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8058515Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8059149Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8059774Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8060395Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8060525Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:55.8060604Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8060647Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8060686Z unimplemented [] 2025-12-04T09:58:55.8060748Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8060852Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8061420Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8061472Z graph_break [] 2025-12-04T09:58:55.8061548Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8061590Z Autotune Choices Stats: 2025-12-04T09:58:55.8062347Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:55.8062484Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8062600Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8062760Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8063375Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8063982Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8064583Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8065191Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8065793Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8066476Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8067074Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8067692Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8068296Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8068900Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8069037Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:55.8069080Z Autotune Choices Stats: 2025-12-04T09:58:55.8069841Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.8070058Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8070237Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8070515Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8071161Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8071801Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8072421Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8073048Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8073673Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8074300Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8074945Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8075585Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8076260Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8076886Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8077015Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:55.8077095Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8077139Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8077179Z unimplemented [] 2025-12-04T09:58:55.8077241Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8077344Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8077918Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8077959Z graph_break [] 2025-12-04T09:58:55.8078036Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8078080Z Autotune Choices Stats: 2025-12-04T09:58:55.8078821Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.8078963Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8079111Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8079272Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8079889Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8080508Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8081108Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8081715Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8082321Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8082922Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8083551Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8084159Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8084773Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8085376Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8085506Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:55.8085548Z Autotune Choices Stats: 2025-12-04T09:58:55.8086350Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.8086570Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8086736Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8087012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8087671Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8088319Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8088956Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8089579Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8090212Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8090843Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8091465Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8092127Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8092757Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8093393Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8093522Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:55.8093600Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8093644Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8093684Z unimplemented [] 2025-12-04T09:58:55.8093745Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8093846Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8094417Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8094456Z graph_break [] 2025-12-04T09:58:55.8094532Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8094572Z Autotune Choices Stats: 2025-12-04T09:58:55.8095315Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.8095440Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8095566Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8095725Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8096400Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8097012Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8097618Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8098217Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8098816Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8099425Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8100028Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8100668Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8101268Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8101887Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8102015Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:55.8102058Z Autotune Choices Stats: 2025-12-04T09:58:55.8102815Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:55.8103031Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8103200Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8103473Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8104107Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8104772Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8105393Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8106067Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8106697Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8107323Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8107945Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8108574Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8109241Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8109861Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8110000Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:55.8110081Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8110126Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8110164Z unimplemented [] 2025-12-04T09:58:55.8110225Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8110326Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8110909Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8110949Z graph_break [] 2025-12-04T09:58:55.8111024Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8111063Z Autotune Choices Stats: 2025-12-04T09:58:55.8111797Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.8111925Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8112039Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8112285Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8112896Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8113524Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8114138Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8114736Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8115344Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8115979Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8116582Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8117183Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8117825Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8118439Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8118569Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:55.8118609Z Autotune Choices Stats: 2025-12-04T09:58:55.8119364Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.8119581Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8119747Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8120023Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8120650Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8121274Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8121940Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8122561Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8123197Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8123824Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8124446Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8125068Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8125693Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8126389Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8126532Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:55.8126608Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8126652Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8126688Z unimplemented [] 2025-12-04T09:58:55.8126749Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8126849Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8127425Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8127464Z graph_break [] 2025-12-04T09:58:55.8127538Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8127578Z Autotune Choices Stats: 2025-12-04T09:58:55.8128324Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:55.8128453Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8128569Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8128732Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8129355Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8129960Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8130600Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8131213Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8131810Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8132416Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8133020Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8133624Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8134222Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8134853Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8135002Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:55.8135044Z Autotune Choices Stats: 2025-12-04T09:58:55.8135807Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:55.8136060Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8136226Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8136505Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8137147Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8137781Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8138402Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8139063Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8139788Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8140452Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8141092Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8141729Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8142362Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8142988Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8143143Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:55.8143236Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8143291Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8143333Z unimplemented [] 2025-12-04T09:58:55.8143396Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8143496Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8144086Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8144125Z graph_break [] 2025-12-04T09:58:55.8144199Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8144244Z Autotune Choices Stats: 2025-12-04T09:58:55.8144988Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:55.8145115Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8145230Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8145393Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8146040Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8146650Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8147272Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8147905Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8148524Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8149131Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8149758Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8150363Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8150974Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8151577Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8151718Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:55.8151770Z Autotune Choices Stats: 2025-12-04T09:58:55.8152554Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.8152786Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8152955Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8153236Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8153875Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8154503Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8155135Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8155763Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8156467Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8157106Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8157735Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8158365Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8158991Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8159615Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8159745Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:55.8159839Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8159883Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8159921Z unimplemented [] 2025-12-04T09:58:55.8159982Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8160082Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8160674Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8160727Z graph_break [] 2025-12-04T09:58:55.8160800Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8160841Z Autotune Choices Stats: 2025-12-04T09:58:55.8161580Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:55.8161706Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8161825Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8161987Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8162600Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8163199Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8163802Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8164421Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8165055Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8165668Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8166317Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8166922Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8167522Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8168123Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8168272Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:55.8168315Z Autotune Choices Stats: 2025-12-04T09:58:55.8169112Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.8169330Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8169511Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8169789Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8170421Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8171059Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8171682Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8172310Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8172935Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8173602Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8174237Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8174866Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8175497Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8176168Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8176298Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:55.8176376Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8176419Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8176458Z unimplemented [] 2025-12-04T09:58:55.8176518Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8176617Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8177217Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8177254Z graph_break [] 2025-12-04T09:58:55.8177343Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8177398Z Autotune Choices Stats: 2025-12-04T09:58:55.8178143Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.8178283Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8178401Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8178561Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8179175Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8179787Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8180395Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8180998Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8181638Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8182252Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8182866Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8183471Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8184080Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8184687Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8184818Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:55.8184867Z Autotune Choices Stats: 2025-12-04T09:58:55.8185627Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.8185867Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8186119Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8186397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8187054Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8187680Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8188305Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8188933Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8189569Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8190211Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8190856Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8191507Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8192139Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8192767Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8192898Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:55.8192979Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8193022Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8193062Z unimplemented [] 2025-12-04T09:58:55.8193124Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8193229Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8193804Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8193856Z graph_break [] 2025-12-04T09:58:55.8193934Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8193975Z Autotune Choices Stats: 2025-12-04T09:58:55.8194735Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:55.8194875Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8194995Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8195154Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8195782Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8196425Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8197028Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8197637Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8198249Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8198890Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8199494Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8200122Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8200725Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8201332Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8201462Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:55.8201507Z Autotune Choices Stats: 2025-12-04T09:58:55.8202271Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.8202490Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8202668Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8202956Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8203598Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8204237Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8204866Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8205492Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8206162Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8206797Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8207463Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8208090Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8208731Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8209363Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8209493Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:55.8209571Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8209617Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8209656Z unimplemented [] 2025-12-04T09:58:55.8209718Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8209821Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8210394Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8210433Z graph_break [] 2025-12-04T09:58:55.8210513Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8210557Z Autotune Choices Stats: 2025-12-04T09:58:55.8211304Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.8211448Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8211583Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8211749Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8212368Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8212979Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8213590Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8214194Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8214798Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8215402Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8216092Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8216691Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8217307Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8217916Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8218051Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:55.8218093Z Autotune Choices Stats: 2025-12-04T09:58:55.8218859Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.8219077Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8219245Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8219525Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8220196Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8220823Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8221461Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8222087Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8222723Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8223351Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8223976Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8224638Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8225268Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8225919Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8226085Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:55.8226159Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8226204Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8226242Z unimplemented [] 2025-12-04T09:58:55.8226305Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8226405Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8226986Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8227027Z graph_break [] 2025-12-04T09:58:55.8227106Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8227147Z Autotune Choices Stats: 2025-12-04T09:58:55.8227890Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:55.8228020Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8228151Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8228312Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8228944Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8229568Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8230172Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8230776Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8231384Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8231992Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8232600Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8233244Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8233860Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8234465Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8234598Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:55.8234639Z Autotune Choices Stats: 2025-12-04T09:58:55.8235401Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.8235620Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8235789Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8236108Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8236740Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8237406Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8238030Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8238676Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8239298Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8239920Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8240548Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8241177Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8241830Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8242473Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8242607Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:55.8242683Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8242728Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8242765Z unimplemented [] 2025-12-04T09:58:55.8242830Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8242930Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8243500Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8243543Z graph_break [] 2025-12-04T09:58:55.8243620Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8243666Z Autotune Choices Stats: 2025-12-04T09:58:55.8244408Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:55.8244539Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8244654Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8244817Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8245430Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8246112Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8246729Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8247340Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8247948Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8248554Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8249164Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8249771Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8250417Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8251033Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8251166Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:55.8251209Z Autotune Choices Stats: 2025-12-04T09:58:55.8251979Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.8252198Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8252363Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8252643Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8253280Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8253913Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8254571Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8255206Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8255842Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8256512Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8257141Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8257770Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8258403Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8259073Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8259223Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:55.8259298Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8259342Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8259380Z unimplemented [] 2025-12-04T09:58:55.8259445Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8259547Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8260125Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8260168Z graph_break [] 2025-12-04T09:58:55.8260242Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8260287Z Autotune Choices Stats: 2025-12-04T09:58:55.8261021Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:55.8261152Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8261268Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8261427Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8262042Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8262664Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8263291Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8263915Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8264522Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8265131Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8265738Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8266388Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8266996Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8267639Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8267784Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:55.8267827Z Autotune Choices Stats: 2025-12-04T09:58:55.8268594Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.8268813Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8268982Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8269270Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8269905Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8270533Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8271159Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8271819Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8272462Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8273090Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8273726Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8274357Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8274985Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8275608Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8275762Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:55.8275853Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8275895Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8275959Z unimplemented [] 2025-12-04T09:58:55.8276020Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8276140Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8276712Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8276755Z graph_break [] 2025-12-04T09:58:55.8276832Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8276876Z Autotune Choices Stats: 2025-12-04T09:58:55.8277610Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.8277740Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8277859Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8278019Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8278642Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8279248Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8279883Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8280511Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8281129Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8281730Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8282337Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8282947Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8283555Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8284170Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8284325Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:55.8284370Z Autotune Choices Stats: 2025-12-04T09:58:55.8285135Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.8285368Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8285538Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8285815Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8286478Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8287107Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8287737Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8288384Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8289038Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8289684Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8290305Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8290936Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8291567Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8292197Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8292338Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:55.8292418Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8292461Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8292501Z unimplemented [] 2025-12-04T09:58:55.8292562Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8292663Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8293256Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8293311Z graph_break [] 2025-12-04T09:58:55.8293389Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8293433Z Autotune Choices Stats: 2025-12-04T09:58:55.8294181Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:55.8294308Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8296186Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8296348Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8296962Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8297567Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8298187Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8298858Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8299465Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8300080Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8300684Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8301369Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8301978Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8302590Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8302736Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:55.8302782Z Autotune Choices Stats: 2025-12-04T09:58:55.8303567Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.8303788Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8303957Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8304237Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8304879Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8305528Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8306182Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8306817Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8307486Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8308130Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8308759Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8309392Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8310051Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8310683Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8310815Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:55.8310896Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8310938Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8310980Z unimplemented [] 2025-12-04T09:58:55.8311046Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8311164Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8311759Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8311804Z graph_break [] 2025-12-04T09:58:55.8311890Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8311936Z Autotune Choices Stats: 2025-12-04T09:58:55.8312680Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:55.8312813Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8312936Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8313098Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8313725Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8314351Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8314959Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8315692Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8316373Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8316977Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8317587Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8318201Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8318827Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8319435Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8319570Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:55.8319616Z Autotune Choices Stats: 2025-12-04T09:58:55.8320385Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.8320648Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8320821Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8321101Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8321745Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8322378Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8323023Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8323658Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8324297Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8324975Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8325604Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8326281Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8326912Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8327568Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8327702Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:55.8327785Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8327831Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8327875Z unimplemented [] 2025-12-04T09:58:55.8327937Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8328045Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8328620Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8328677Z graph_break [] 2025-12-04T09:58:55.8328759Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8328801Z Autotune Choices Stats: 2025-12-04T09:58:55.8329582Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:55.8329712Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8329835Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8329995Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8330614Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8331233Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8331859Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8332469Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8333076Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8333721Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8334337Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8334946Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8335557Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8336222Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8336355Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:55.8336403Z Autotune Choices Stats: 2025-12-04T09:58:55.8337167Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.8337409Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8337580Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8337887Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8338529Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8339166Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8339796Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8340445Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8341083Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8341720Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8342389Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8343021Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8343666Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8344299Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8344449Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:55.8344526Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8344575Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8344619Z unimplemented [] 2025-12-04T09:58:55.8344686Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8344790Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8345379Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8345420Z graph_break [] 2025-12-04T09:58:55.8345502Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8345545Z Autotune Choices Stats: 2025-12-04T09:58:55.8346330Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.8346525Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8346643Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8346811Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8347431Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8348047Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8348678Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8349287Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8349903Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8350514Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8351153Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8351755Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8352369Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8352980Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8353128Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:55.8353172Z Autotune Choices Stats: 2025-12-04T09:58:55.8353941Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.8354161Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8354335Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8354631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8355295Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8355965Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8356601Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8357234Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8357882Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8358518Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8359155Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8359837Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8360465Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8361100Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8361235Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:55.8361324Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8361374Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8361413Z unimplemented [] 2025-12-04T09:58:55.8361482Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8361585Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8362164Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8362204Z graph_break [] 2025-12-04T09:58:55.8362284Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8362326Z Autotune Choices Stats: 2025-12-04T09:58:55.8363078Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:55.8363237Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8363354Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8363520Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8364161Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8364771Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8365454Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8366124Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8366735Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8367346Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8367960Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8368612Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8369217Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8369829Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8369992Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:55.8370035Z Autotune Choices Stats: 2025-12-04T09:58:55.8370807Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.8371026Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8371193Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8371477Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8372113Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8372781Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8373410Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8374047Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8374680Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8375325Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8375992Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8376627Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8377311Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8377945Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8378082Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:55.8378160Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8378213Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8378253Z unimplemented [] 2025-12-04T09:58:55.8378322Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8378424Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8379023Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8379068Z graph_break [] 2025-12-04T09:58:55.8379145Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8379191Z Autotune Choices Stats: 2025-12-04T09:58:55.8379939Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:55.8380078Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8380195Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8380361Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8380982Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8381608Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8382222Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8382836Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8383457Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8384068Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8384677Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8385297Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8385962Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8386571Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8386706Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:55.8386748Z Autotune Choices Stats: 2025-12-04T09:58:55.8387513Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.8387756Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8387924Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8388211Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8388856Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8389483Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8390151Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8390787Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8391420Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8392063Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8392695Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8393338Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8393985Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8394640Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8394778Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:55.8394861Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8394906Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8394951Z unimplemented [] 2025-12-04T09:58:55.8395016Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8395125Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8395701Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8395756Z graph_break [] 2025-12-04T09:58:55.8395832Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8395879Z Autotune Choices Stats: 2025-12-04T09:58:55.8396654Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1708", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.8396789Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8396910Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8397074Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8397703Z triton_flex_attention_1708 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8398356Z triton_flex_attention_1709 0.0109 ms 96.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8398963Z triton_flex_attention_1707 0.0117 ms 89.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8399573Z triton_flex_attention_1705 0.0130 ms 80.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8400189Z triton_flex_attention_1724 0.0135 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8400810Z triton_flex_attention_1706 0.0136 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8401418Z triton_flex_attention_1716 0.0142 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8402037Z triton_flex_attention_1722 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8402653Z triton_flex_attention_1714 0.0149 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8403278Z triton_flex_attention_1720 0.0162 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8403415Z SingleProcess AUTOTUNE benchmarking takes 0.2434 seconds and 0.4106 seconds precompiling for 24 choices 2025-12-04T09:58:55.8403461Z Autotune Choices Stats: 2025-12-04T09:58:55.8404234Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015359999611973763, "best_triton_pos": 0} 2025-12-04T09:58:55.8404458Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8404635Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8404917Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8405555Z triton_flex_attention_backward_1743 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8406236Z triton_flex_attention_backward_1737 0.0181 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8406880Z triton_flex_attention_backward_1734 0.0187 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8407537Z triton_flex_attention_backward_1735 0.0188 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8408174Z triton_flex_attention_backward_1745 0.0203 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8408808Z triton_flex_attention_backward_1744 0.0203 ms 75.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8409447Z triton_flex_attention_backward_1742 0.0218 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8410081Z triton_flex_attention_backward_1747 0.0220 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8410715Z triton_flex_attention_backward_1738 0.0228 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8411378Z triton_flex_attention_backward_1729 0.0230 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8411520Z SingleProcess AUTOTUNE benchmarking takes 0.2527 seconds and 0.7882 seconds precompiling for 22 choices 2025-12-04T09:58:55.8411597Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8411642Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8411686Z unimplemented [] 2025-12-04T09:58:55.8411751Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8411855Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8412434Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8412478Z graph_break [] 2025-12-04T09:58:55.8412555Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8412599Z Autotune Choices Stats: 2025-12-04T09:58:55.8413355Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1754", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009398999623954296, "best_triton_pos": 0} 2025-12-04T09:58:55.8413494Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8413610Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8413769Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8414387Z triton_flex_attention_1754 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8414992Z triton_flex_attention_1755 0.0104 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8415633Z triton_flex_attention_1752 0.0112 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8416287Z triton_flex_attention_1753 0.0117 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8416890Z triton_flex_attention_1750 0.0120 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8417496Z triton_flex_attention_1770 0.0132 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8418117Z triton_flex_attention_1751 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8418721Z triton_flex_attention_1762 0.0140 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8419328Z triton_flex_attention_1768 0.0146 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8419973Z triton_flex_attention_1760 0.0150 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8420102Z SingleProcess AUTOTUNE benchmarking takes 0.2227 seconds and 0.4678 seconds precompiling for 24 choices 2025-12-04T09:58:55.8420145Z Autotune Choices Stats: 2025-12-04T09:58:55.8420907Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.8421126Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8421292Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8421566Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8422211Z triton_flex_attention_backward_1789 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8422836Z triton_flex_attention_backward_1783 0.0184 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8423460Z triton_flex_attention_backward_1780 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8424115Z triton_flex_attention_backward_1781 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8424757Z triton_flex_attention_backward_1791 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8425386Z triton_flex_attention_backward_1790 0.0204 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8426038Z triton_flex_attention_backward_1788 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8426690Z triton_flex_attention_backward_1793 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8427314Z triton_flex_attention_backward_1784 0.0226 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8427939Z triton_flex_attention_backward_1775 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8428087Z SingleProcess AUTOTUNE benchmarking takes 0.2632 seconds and 0.8758 seconds precompiling for 22 choices 2025-12-04T09:58:55.8428165Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8428207Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8428248Z unimplemented [] 2025-12-04T09:58:55.8428310Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8428438Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8429009Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8429050Z graph_break [] 2025-12-04T09:58:55.8429129Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8429169Z Autotune Choices Stats: 2025-12-04T09:58:55.8429914Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1801", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.8430041Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8430171Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8430330Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8430946Z triton_flex_attention_1801 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8431554Z triton_flex_attention_1800 0.0108 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8432160Z triton_flex_attention_1816 0.0128 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8432794Z triton_flex_attention_1798 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8433398Z triton_flex_attention_1797 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8434002Z triton_flex_attention_1808 0.0133 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8434604Z triton_flex_attention_1814 0.0140 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8435219Z triton_flex_attention_1806 0.0150 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8435826Z triton_flex_attention_1799 0.0158 ms 66.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8436457Z triton_flex_attention_1812 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8436602Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4169 seconds precompiling for 24 choices 2025-12-04T09:58:55.8436645Z Autotune Choices Stats: 2025-12-04T09:58:55.8437432Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1835", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.8437651Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8437820Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8438098Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8438728Z triton_flex_attention_backward_1835 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8439372Z triton_flex_attention_backward_1829 0.0184 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8440001Z triton_flex_attention_backward_1826 0.0186 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8443969Z triton_flex_attention_backward_1827 0.0186 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8444638Z triton_flex_attention_backward_1837 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8445265Z triton_flex_attention_backward_1836 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8445893Z triton_flex_attention_backward_1834 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8446565Z triton_flex_attention_backward_1839 0.0221 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8447219Z triton_flex_attention_backward_1830 0.0228 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8447847Z triton_flex_attention_backward_1821 0.0230 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8447979Z SingleProcess AUTOTUNE benchmarking takes 0.2624 seconds and 0.8439 seconds precompiling for 22 choices 2025-12-04T09:58:55.8448062Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8448106Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8448163Z unimplemented [] 2025-12-04T09:58:55.8448226Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8448330Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8448939Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8448977Z graph_break [] 2025-12-04T09:58:55.8449057Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8449097Z Autotune Choices Stats: 2025-12-04T09:58:55.8449842Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1846", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009399999864399433, "best_triton_pos": 0} 2025-12-04T09:58:55.8449973Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8450091Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8450252Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8450865Z triton_flex_attention_1846 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8451478Z triton_flex_attention_1844 0.0102 ms 91.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8452084Z triton_flex_attention_1845 0.0120 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8452691Z triton_flex_attention_1843 0.0130 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8453320Z triton_flex_attention_1854 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8453922Z triton_flex_attention_1862 0.0134 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8454526Z triton_flex_attention_1842 0.0137 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8455126Z triton_flex_attention_1847 0.0138 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8455739Z triton_flex_attention_1860 0.0144 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8456373Z triton_flex_attention_1852 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8456502Z SingleProcess AUTOTUNE benchmarking takes 0.2274 seconds and 0.3833 seconds precompiling for 24 choices 2025-12-04T09:58:55.8456544Z Autotune Choices Stats: 2025-12-04T09:58:55.8457305Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01583999954164028, "best_triton_pos": 0} 2025-12-04T09:58:55.8457568Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8457734Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8458013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8458643Z triton_flex_attention_backward_1881 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8459269Z triton_flex_attention_backward_1875 0.0184 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8459907Z triton_flex_attention_backward_1873 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8460529Z triton_flex_attention_backward_1872 0.0188 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8461162Z triton_flex_attention_backward_1883 0.0201 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8461820Z triton_flex_attention_backward_1882 0.0202 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8462441Z triton_flex_attention_backward_1880 0.0220 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8463066Z triton_flex_attention_backward_1885 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8463692Z triton_flex_attention_backward_1876 0.0224 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8464326Z triton_flex_attention_backward_1867 0.0232 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8464457Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.7872 seconds precompiling for 22 choices 2025-12-04T09:58:55.8464533Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8464579Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8464616Z unimplemented [] 2025-12-04T09:58:55.8464682Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8464784Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8465355Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8465405Z graph_break [] 2025-12-04T09:58:55.8465480Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8465519Z Autotune Choices Stats: 2025-12-04T09:58:55.8466361Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1893", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.8466492Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8466606Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8466769Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8467377Z triton_flex_attention_1893 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8467996Z triton_flex_attention_1892 0.0106 ms 95.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8468596Z triton_flex_attention_1891 0.0117 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8469200Z triton_flex_attention_1890 0.0127 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8469800Z triton_flex_attention_1908 0.0130 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8470437Z triton_flex_attention_1889 0.0132 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8471039Z triton_flex_attention_1900 0.0135 ms 74.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8471644Z triton_flex_attention_1906 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8472250Z triton_flex_attention_1898 0.0148 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8472863Z triton_flex_attention_1904 0.0162 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8472993Z SingleProcess AUTOTUNE benchmarking takes 0.2478 seconds and 0.5052 seconds precompiling for 24 choices 2025-12-04T09:58:55.8473034Z Autotune Choices Stats: 2025-12-04T09:58:55.8473790Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1927", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.8474018Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8474195Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8474488Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8475120Z triton_flex_attention_backward_1927 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8475741Z triton_flex_attention_backward_1921 0.0183 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8476400Z triton_flex_attention_backward_1918 0.0185 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8477046Z triton_flex_attention_backward_1919 0.0186 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8477674Z triton_flex_attention_backward_1929 0.0201 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8478299Z triton_flex_attention_backward_1928 0.0202 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8478981Z triton_flex_attention_backward_1926 0.0217 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8479610Z triton_flex_attention_backward_1931 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8480233Z triton_flex_attention_backward_1922 0.0227 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8480870Z triton_flex_attention_backward_1913 0.0230 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8480999Z SingleProcess AUTOTUNE benchmarking takes 0.2709 seconds and 0.9165 seconds precompiling for 22 choices 2025-12-04T09:58:55.8481073Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8481119Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8481156Z unimplemented [] 2025-12-04T09:58:55.8481219Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8481318Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8481890Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8481927Z graph_break [] 2025-12-04T09:58:55.8482001Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8482041Z Autotune Choices Stats: 2025-12-04T09:58:55.8482805Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1938", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009960000403225422, "best_triton_pos": 0} 2025-12-04T09:58:55.8482945Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8483061Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8483226Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8483836Z triton_flex_attention_1938 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8484443Z triton_flex_attention_1936 0.0100 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8485060Z triton_flex_attention_1939 0.0101 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8485664Z triton_flex_attention_1935 0.0129 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8486299Z triton_flex_attention_1937 0.0134 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8486898Z triton_flex_attention_1946 0.0137 ms 72.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8487540Z triton_flex_attention_1954 0.0139 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8488146Z triton_flex_attention_1952 0.0146 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8488754Z triton_flex_attention_1944 0.0151 ms 66.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8489372Z triton_flex_attention_1950 0.0165 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8489500Z SingleProcess AUTOTUNE benchmarking takes 0.2498 seconds and 0.4270 seconds precompiling for 24 choices 2025-12-04T09:58:55.8489541Z Autotune Choices Stats: 2025-12-04T09:58:55.8490302Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1973", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.8490519Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8490683Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8490972Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8491620Z triton_flex_attention_backward_1973 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8492248Z triton_flex_attention_backward_1967 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8492868Z triton_flex_attention_backward_1964 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8493501Z triton_flex_attention_backward_1965 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8494124Z triton_flex_attention_backward_1975 0.0199 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8494754Z triton_flex_attention_backward_1974 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8495372Z triton_flex_attention_backward_1972 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8496092Z triton_flex_attention_backward_1977 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8496721Z triton_flex_attention_backward_1968 0.0226 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8497351Z triton_flex_attention_backward_1959 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8497497Z SingleProcess AUTOTUNE benchmarking takes 0.2677 seconds and 0.8736 seconds precompiling for 22 choices 2025-12-04T09:58:55.8497571Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8497615Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8497651Z unimplemented [] 2025-12-04T09:58:55.8497714Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8497814Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8498394Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8498433Z graph_break [] 2025-12-04T09:58:55.8498506Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8498546Z Autotune Choices Stats: 2025-12-04T09:58:55.8499290Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1984", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009600000455975533, "best_triton_pos": 0} 2025-12-04T09:58:55.8499433Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8499547Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8499719Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8500342Z triton_flex_attention_1984 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8500943Z triton_flex_attention_1982 0.0101 ms 94.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8501544Z triton_flex_attention_1983 0.0116 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8502159Z triton_flex_attention_2000 0.0130 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8502760Z triton_flex_attention_1985 0.0132 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8503365Z triton_flex_attention_1981 0.0133 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8503970Z triton_flex_attention_1992 0.0137 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8504590Z triton_flex_attention_1998 0.0140 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8505193Z triton_flex_attention_1990 0.0150 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8505797Z triton_flex_attention_1996 0.0162 ms 59.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8505990Z SingleProcess AUTOTUNE benchmarking takes 0.2470 seconds and 0.3620 seconds precompiling for 24 choices 2025-12-04T09:58:55.8506030Z Autotune Choices Stats: 2025-12-04T09:58:55.8506790Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2019", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.8507007Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8507172Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8507451Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8508099Z triton_flex_attention_backward_2019 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8508760Z triton_flex_attention_backward_2013 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8509382Z triton_flex_attention_backward_2010 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8510014Z triton_flex_attention_backward_2011 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8510669Z triton_flex_attention_backward_2021 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8511293Z triton_flex_attention_backward_2020 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8511915Z triton_flex_attention_backward_2018 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8512558Z triton_flex_attention_backward_2023 0.0222 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8513205Z triton_flex_attention_backward_2014 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8513827Z triton_flex_attention_backward_2005 0.0232 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8513956Z SingleProcess AUTOTUNE benchmarking takes 0.2594 seconds and 0.8715 seconds precompiling for 22 choices 2025-12-04T09:58:55.8514030Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8514073Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8514110Z unimplemented [] 2025-12-04T09:58:55.8514171Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8514283Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8514855Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8514892Z graph_break [] 2025-12-04T09:58:55.8514965Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8515007Z Autotune Choices Stats: 2025-12-04T09:58:55.8515742Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2030", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.8515870Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8516018Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8516192Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8516836Z triton_flex_attention_2030 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8517436Z triton_flex_attention_2031 0.0108 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8518044Z triton_flex_attention_2026 0.0112 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8518643Z triton_flex_attention_2028 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8519262Z triton_flex_attention_2029 0.0116 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8519866Z triton_flex_attention_2046 0.0132 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8520466Z triton_flex_attention_2027 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8521091Z triton_flex_attention_2038 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8521705Z triton_flex_attention_2044 0.0144 ms 64.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8522302Z triton_flex_attention_2024 0.0147 ms 63.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8522432Z SingleProcess AUTOTUNE benchmarking takes 0.1936 seconds and 0.4021 seconds precompiling for 24 choices 2025-12-04T09:58:55.8522473Z Autotune Choices Stats: 2025-12-04T09:58:55.8523225Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2065", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.8523457Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8523622Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8523902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8524535Z triton_flex_attention_backward_2065 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8525168Z triton_flex_attention_backward_2059 0.0182 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8525825Z triton_flex_attention_backward_2056 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8526480Z triton_flex_attention_backward_2057 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8527105Z triton_flex_attention_backward_2066 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8527751Z triton_flex_attention_backward_2067 0.0200 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8528373Z triton_flex_attention_backward_2064 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8528996Z triton_flex_attention_backward_2069 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8529659Z triton_flex_attention_backward_2060 0.0224 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8530291Z triton_flex_attention_backward_2051 0.0230 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8530420Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.8209 seconds precompiling for 22 choices 2025-12-04T09:58:55.8530495Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8530537Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8530575Z unimplemented [] 2025-12-04T09:58:55.8530637Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8530739Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8531311Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8531360Z graph_break [] 2025-12-04T09:58:55.8531434Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8531476Z Autotune Choices Stats: 2025-12-04T09:58:55.8532213Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2077", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010320000350475311, "best_triton_pos": 0} 2025-12-04T09:58:55.8532341Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8532455Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8532617Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8533234Z triton_flex_attention_2077 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8533860Z triton_flex_attention_2074 0.0118 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8534470Z triton_flex_attention_2076 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8535079Z triton_flex_attention_2073 0.0130 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8535681Z triton_flex_attention_2084 0.0136 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8536328Z triton_flex_attention_2092 0.0139 ms 74.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8536933Z triton_flex_attention_2090 0.0144 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8537542Z triton_flex_attention_2082 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8538180Z triton_flex_attention_2075 0.0154 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8538780Z triton_flex_attention_2088 0.0165 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8538908Z SingleProcess AUTOTUNE benchmarking takes 0.2499 seconds and 0.3908 seconds precompiling for 24 choices 2025-12-04T09:58:55.8538950Z Autotune Choices Stats: 2025-12-04T09:58:55.8539711Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2111", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.8539939Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8540105Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8540379Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8541008Z triton_flex_attention_backward_2111 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8541639Z triton_flex_attention_backward_2105 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8542293Z triton_flex_attention_backward_2110 0.0181 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8542914Z triton_flex_attention_backward_2102 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8543545Z triton_flex_attention_backward_2103 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8544168Z triton_flex_attention_backward_2113 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8544803Z triton_flex_attention_backward_2112 0.0204 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8545435Z triton_flex_attention_backward_2115 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8546101Z triton_flex_attention_backward_2097 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8546762Z triton_flex_attention_backward_2106 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8546890Z SingleProcess AUTOTUNE benchmarking takes 0.4709 seconds and 0.7187 seconds precompiling for 22 choices 2025-12-04T09:58:55.8546965Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8547008Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8547047Z unimplemented [] 2025-12-04T09:58:55.8547108Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8547209Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8547786Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8547824Z graph_break [] 2025-12-04T09:58:55.8547899Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8547939Z Autotune Choices Stats: 2025-12-04T09:58:55.8548675Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2122", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008960000239312649, "best_triton_pos": 0} 2025-12-04T09:58:55.8548814Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8548928Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8549088Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8549701Z triton_flex_attention_2122 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8550306Z triton_flex_attention_2123 0.0100 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8550933Z triton_flex_attention_2119 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8551535Z triton_flex_attention_2121 0.0133 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8552148Z triton_flex_attention_2138 0.0134 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8552754Z triton_flex_attention_2130 0.0139 ms 64.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8553365Z triton_flex_attention_2120 0.0142 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8553972Z triton_flex_attention_2136 0.0145 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8554578Z triton_flex_attention_2128 0.0149 ms 60.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8555210Z triton_flex_attention_2134 0.0166 ms 53.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8555338Z SingleProcess AUTOTUNE benchmarking takes 0.2470 seconds and 0.4797 seconds precompiling for 24 choices 2025-12-04T09:58:55.8555379Z Autotune Choices Stats: 2025-12-04T09:58:55.8556175Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2157", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.8556390Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8556556Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8556847Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8557481Z triton_flex_attention_backward_2157 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8558105Z triton_flex_attention_backward_2151 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8558736Z triton_flex_attention_backward_2149 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8559412Z triton_flex_attention_backward_2148 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8560037Z triton_flex_attention_backward_2159 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8560662Z triton_flex_attention_backward_2158 0.0203 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8561291Z triton_flex_attention_backward_2156 0.0216 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8561927Z triton_flex_attention_backward_2161 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8562551Z triton_flex_attention_backward_2152 0.0228 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8563176Z triton_flex_attention_backward_2143 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8563317Z SingleProcess AUTOTUNE benchmarking takes 0.2555 seconds and 0.9394 seconds precompiling for 22 choices 2025-12-04T09:58:55.8563391Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8563435Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8563484Z unimplemented [] 2025-12-04T09:58:55.8563555Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8563654Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8564227Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8564264Z graph_break [] 2025-12-04T09:58:55.8564338Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8564378Z Autotune Choices Stats: 2025-12-04T09:58:55.8565117Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2168", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009200000204145908, "best_triton_pos": 0} 2025-12-04T09:58:55.8565258Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8565372Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8565535Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8566205Z triton_flex_attention_2168 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8566809Z triton_flex_attention_2166 0.0101 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8567412Z triton_flex_attention_2169 0.0104 ms 88.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8568055Z triton_flex_attention_2167 0.0113 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8568659Z triton_flex_attention_2184 0.0132 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8569262Z triton_flex_attention_2165 0.0133 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8569869Z triton_flex_attention_2176 0.0135 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8570489Z triton_flex_attention_2182 0.0140 ms 65.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8571093Z triton_flex_attention_2174 0.0150 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8571695Z triton_flex_attention_2180 0.0164 ms 56.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8571836Z SingleProcess AUTOTUNE benchmarking takes 0.2350 seconds and 0.4301 seconds precompiling for 24 choices 2025-12-04T09:58:55.8571878Z Autotune Choices Stats: 2025-12-04T09:58:55.8572654Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2203", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.8572871Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8573040Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8573318Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8573951Z triton_flex_attention_backward_2203 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8574600Z triton_flex_attention_backward_2197 0.0181 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8575222Z triton_flex_attention_backward_2195 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8575847Z triton_flex_attention_backward_2194 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8576543Z triton_flex_attention_backward_2205 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8577176Z triton_flex_attention_backward_2204 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8577800Z triton_flex_attention_backward_2202 0.0217 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8578431Z triton_flex_attention_backward_2207 0.0219 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8579086Z triton_flex_attention_backward_2198 0.0227 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8579709Z triton_flex_attention_backward_2189 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8579840Z SingleProcess AUTOTUNE benchmarking takes 0.2634 seconds and 0.7312 seconds precompiling for 22 choices 2025-12-04T09:58:55.8579914Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8579972Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8580010Z unimplemented [] 2025-12-04T09:58:55.8580074Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8580173Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8580766Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8580804Z graph_break [] 2025-12-04T09:58:55.8580880Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8580923Z Autotune Choices Stats: 2025-12-04T09:58:55.8581665Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2212", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009920000098645687, "best_triton_pos": 0} 2025-12-04T09:58:55.8581795Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8581907Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8582070Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8582689Z triton_flex_attention_2212 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8583294Z triton_flex_attention_2214 0.0108 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8583898Z triton_flex_attention_2213 0.0111 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8584503Z triton_flex_attention_2230 0.0128 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8585133Z triton_flex_attention_2211 0.0128 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8585746Z triton_flex_attention_2222 0.0133 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8586401Z triton_flex_attention_2215 0.0134 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8587020Z triton_flex_attention_2228 0.0143 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8587617Z triton_flex_attention_2220 0.0147 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8588227Z triton_flex_attention_2226 0.0164 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8588357Z SingleProcess AUTOTUNE benchmarking takes 0.2288 seconds and 0.3817 seconds precompiling for 24 choices 2025-12-04T09:58:55.8588411Z Autotune Choices Stats: 2025-12-04T09:58:55.8589201Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2249", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015759000554680824, "best_triton_pos": 0} 2025-12-04T09:58:55.8589419Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8589581Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8589859Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8590495Z triton_flex_attention_backward_2249 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8591120Z triton_flex_attention_backward_2243 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8591755Z triton_flex_attention_backward_2241 0.0186 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8592382Z triton_flex_attention_backward_2240 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8593008Z triton_flex_attention_backward_2251 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8593668Z triton_flex_attention_backward_2250 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8594294Z triton_flex_attention_backward_2253 0.0218 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8594925Z triton_flex_attention_backward_2248 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8595564Z triton_flex_attention_backward_2244 0.0224 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8596223Z triton_flex_attention_backward_2235 0.0229 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8596353Z SingleProcess AUTOTUNE benchmarking takes 0.2552 seconds and 0.7055 seconds precompiling for 22 choices 2025-12-04T09:58:55.8596446Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T09:58:55.8596498Z Traceback (most recent call last): 2025-12-04T09:58:55.8596654Z File "/var/lib/jenkins/pytorch/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T09:58:55.8596695Z self.assertTrue( 2025-12-04T09:58:55.8596801Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T09:58:55.8596853Z raise self.failureException(msg) 2025-12-04T09:58:55.8597003Z AssertionError: False is not true : Log file /tmp/tmpp3c9nvxc/flex_attention_configs.json was not created 2025-12-04T09:58:55.8597007Z 2025-12-04T09:58:55.8597087Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.8597251Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.8597254Z 2025-12-04T09:58:55.8597353Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.8597442Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8597498Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8597538Z unimplemented [] 2025-12-04T09:58:55.8597599Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8598178Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('select_algorithm_num_precompiles', 46), ('async_compile_cache_miss', 43), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2), ('async_compile_cache_hit', 1)] 2025-12-04T09:58:55.8598279Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8598316Z graph_break [] 2025-12-04T09:58:55.8598388Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8598879Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T09:58:55.8598929Z current_size = base.storage().size() 2025-12-04T09:58:55.8598971Z Autotune Choices Stats: 2025-12-04T09:58:55.8599726Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_6", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.012000000104308128, "best_triton_pos": 0} 2025-12-04T09:58:55.8599854Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8599969Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8600131Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8600743Z triton_flex_attention_6 0.0120 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8601347Z triton_flex_attention_22 0.0131 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8601978Z triton_flex_attention_14 0.0136 ms 88.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8602580Z triton_flex_attention_7 0.0141 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8603179Z triton_flex_attention_20 0.0142 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8603785Z triton_flex_attention_12 0.0150 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8604393Z triton_flex_attention_18 0.0164 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8604993Z triton_flex_attention_10 0.0168 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8605593Z triton_flex_attention_13 0.0181 ms 66.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8606277Z triton_flex_attention_21 0.0183 ms 65.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8606406Z SingleProcess AUTOTUNE benchmarking takes 0.1994 seconds and 0.6176 seconds precompiling for 24 choices 2025-12-04T09:58:55.8606450Z Autotune Choices Stats: 2025-12-04T09:58:55.8607212Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_41", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.8607431Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8607599Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8607892Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8608524Z triton_flex_attention_backward_41 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8609150Z triton_flex_attention_backward_35 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8609772Z triton_flex_attention_backward_32 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8610429Z triton_flex_attention_backward_33 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8611053Z triton_flex_attention_backward_43 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8611679Z triton_flex_attention_backward_42 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8612299Z triton_flex_attention_backward_40 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8612938Z triton_flex_attention_backward_45 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8613563Z triton_flex_attention_backward_36 0.0229 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8614181Z triton_flex_attention_backward_27 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8614322Z SingleProcess AUTOTUNE benchmarking takes 0.2838 seconds and 0.8000 seconds precompiling for 22 choices 2025-12-04T09:58:55.8614399Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8614453Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8614503Z unimplemented [] 2025-12-04T09:58:55.8614566Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8614667Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8615239Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8615278Z graph_break [] 2025-12-04T09:58:55.8615355Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8615396Z Autotune Choices Stats: 2025-12-04T09:58:55.8616167Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_50", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010199000127613544, "best_triton_pos": 0} 2025-12-04T09:58:55.8616311Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8616427Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8616589Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8617203Z triton_flex_attention_50 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8617804Z triton_flex_attention_53 0.0106 ms 95.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8618397Z triton_flex_attention_51 0.0113 ms 90.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8619036Z triton_flex_attention_52 0.0120 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8619641Z triton_flex_attention_68 0.0132 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8620244Z triton_flex_attention_49 0.0137 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8620848Z triton_flex_attention_60 0.0139 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8621464Z triton_flex_attention_66 0.0141 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8622068Z triton_flex_attention_58 0.0147 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8622666Z triton_flex_attention_64 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8622814Z SingleProcess AUTOTUNE benchmarking takes 0.2404 seconds and 0.3300 seconds precompiling for 24 choices 2025-12-04T09:58:55.8622856Z Autotune Choices Stats: 2025-12-04T09:58:55.8623641Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_87", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.8623859Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8624025Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8624299Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8624923Z triton_flex_attention_backward_87 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8625557Z triton_flex_attention_backward_81 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8626215Z triton_flex_attention_backward_79 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8626830Z triton_flex_attention_backward_78 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8627495Z triton_flex_attention_backward_89 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8628124Z triton_flex_attention_backward_88 0.0205 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8628749Z triton_flex_attention_backward_86 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8629371Z triton_flex_attention_backward_91 0.0221 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8630005Z triton_flex_attention_backward_73 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8630631Z triton_flex_attention_backward_82 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8630758Z SingleProcess AUTOTUNE benchmarking takes 0.5360 seconds and 0.7033 seconds precompiling for 22 choices 2025-12-04T09:58:55.8630834Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8630889Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8630928Z unimplemented [] 2025-12-04T09:58:55.8630990Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8631092Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8631695Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8631735Z graph_break [] 2025-12-04T09:58:55.8631809Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8631851Z Autotune Choices Stats: 2025-12-04T09:58:55.8632587Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010400000028312206, "best_triton_pos": 0} 2025-12-04T09:58:55.8632713Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8632827Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8632989Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8633611Z triton_flex_attention_99 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8634212Z triton_flex_attention_98 0.0106 ms 97.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8634818Z triton_flex_attention_97 0.0112 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8635417Z triton_flex_attention_96 0.0126 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8636106Z triton_flex_attention_114 0.0131 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8636702Z triton_flex_attention_106 0.0137 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8637310Z triton_flex_attention_112 0.0142 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8637934Z triton_flex_attention_104 0.0149 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8638528Z triton_flex_attention_95 0.0162 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8639139Z triton_flex_attention_110 0.0164 ms 63.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8639269Z SingleProcess AUTOTUNE benchmarking takes 0.2558 seconds and 0.4810 seconds precompiling for 24 choices 2025-12-04T09:58:55.8639325Z Autotune Choices Stats: 2025-12-04T09:58:55.8640108Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_133", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.8640325Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8640488Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8640770Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8641409Z triton_flex_attention_backward_133 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8642038Z triton_flex_attention_backward_127 0.0183 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8642673Z triton_flex_attention_backward_124 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8643295Z triton_flex_attention_backward_125 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8643920Z triton_flex_attention_backward_134 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8644576Z triton_flex_attention_backward_135 0.0202 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8645197Z triton_flex_attention_backward_132 0.0219 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8645828Z triton_flex_attention_backward_137 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8646502Z triton_flex_attention_backward_128 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8647120Z triton_flex_attention_backward_119 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8647249Z SingleProcess AUTOTUNE benchmarking takes 0.5158 seconds and 0.6793 seconds precompiling for 22 choices 2025-12-04T09:58:55.8647323Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8647367Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8647406Z unimplemented [] 2025-12-04T09:58:55.8647470Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8647570Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8648147Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8648199Z graph_break [] 2025-12-04T09:58:55.8648276Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8648315Z Autotune Choices Stats: 2025-12-04T09:58:55.8649082Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_144", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009119999594986439, "best_triton_pos": 0} 2025-12-04T09:58:55.8649212Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8649327Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8649489Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8650101Z triton_flex_attention_144 0.0091 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8650717Z triton_flex_attention_142 0.0110 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8651317Z triton_flex_attention_145 0.0112 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8651919Z triton_flex_attention_143 0.0117 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8652536Z triton_flex_attention_160 0.0130 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8653164Z triton_flex_attention_152 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8653763Z triton_flex_attention_141 0.0134 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8654366Z triton_flex_attention_158 0.0140 ms 65.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8654979Z triton_flex_attention_150 0.0150 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8655573Z triton_flex_attention_156 0.0164 ms 55.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8655702Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.4350 seconds precompiling for 24 choices 2025-12-04T09:58:55.8655742Z Autotune Choices Stats: 2025-12-04T09:58:55.8656528Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_179", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.8656759Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8656950Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8657228Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8657862Z triton_flex_attention_backward_179 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8658490Z triton_flex_attention_backward_173 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8659131Z triton_flex_attention_backward_171 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8659753Z triton_flex_attention_backward_170 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8660379Z triton_flex_attention_backward_181 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8661004Z triton_flex_attention_backward_180 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8661661Z triton_flex_attention_backward_178 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8662299Z triton_flex_attention_backward_183 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8662923Z triton_flex_attention_backward_174 0.0227 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8663557Z triton_flex_attention_backward_165 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8663687Z SingleProcess AUTOTUNE benchmarking takes 0.2509 seconds and 0.7118 seconds precompiling for 22 choices 2025-12-04T09:58:55.8663762Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8663808Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8663845Z unimplemented [] 2025-12-04T09:58:55.8663911Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8664009Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8664579Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8664616Z graph_break [] 2025-12-04T09:58:55.8664691Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8664742Z Autotune Choices Stats: 2025-12-04T09:58:55.8665503Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_190", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.8665631Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8665745Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8665908Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8666565Z triton_flex_attention_190 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8667165Z triton_flex_attention_191 0.0105 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8667787Z triton_flex_attention_188 0.0116 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8668394Z triton_flex_attention_189 0.0117 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8668994Z triton_flex_attention_187 0.0128 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8669619Z triton_flex_attention_198 0.0134 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8670260Z triton_flex_attention_206 0.0135 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8670866Z triton_flex_attention_204 0.0140 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8671462Z triton_flex_attention_196 0.0148 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8672075Z triton_flex_attention_202 0.0164 ms 56.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8672208Z SingleProcess AUTOTUNE benchmarking takes 0.2491 seconds and 0.3418 seconds precompiling for 24 choices 2025-12-04T09:58:55.8672249Z Autotune Choices Stats: 2025-12-04T09:58:55.8673016Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_225", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.8673233Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8673396Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8673686Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8674331Z triton_flex_attention_backward_225 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8674951Z triton_flex_attention_backward_219 0.0182 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8675581Z triton_flex_attention_backward_216 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8676247Z triton_flex_attention_backward_217 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8676873Z triton_flex_attention_backward_227 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8677498Z triton_flex_attention_backward_226 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8678138Z triton_flex_attention_backward_229 0.0218 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8678787Z triton_flex_attention_backward_224 0.0219 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8679422Z triton_flex_attention_backward_220 0.0227 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8680043Z triton_flex_attention_backward_211 0.0230 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8680187Z SingleProcess AUTOTUNE benchmarking takes 0.2391 seconds and 0.8642 seconds precompiling for 22 choices 2025-12-04T09:58:55.8680260Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8680303Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8680341Z unimplemented [] 2025-12-04T09:58:55.8680404Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8680505Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8681079Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8681117Z graph_break [] 2025-12-04T09:58:55.8681193Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8681236Z Autotune Choices Stats: 2025-12-04T09:58:55.8681975Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_234", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.8682112Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8682225Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8682409Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8683109Z triton_flex_attention_234 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8683719Z triton_flex_attention_236 0.0101 ms 95.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8684323Z triton_flex_attention_237 0.0108 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8684940Z triton_flex_attention_252 0.0131 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8685539Z triton_flex_attention_244 0.0136 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8686176Z triton_flex_attention_250 0.0140 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8686820Z triton_flex_attention_235 0.0141 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8687435Z triton_flex_attention_242 0.0149 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8688043Z triton_flex_attention_248 0.0165 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8688642Z triton_flex_attention_232 0.0167 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8688788Z SingleProcess AUTOTUNE benchmarking takes 0.2295 seconds and 0.4517 seconds precompiling for 24 choices 2025-12-04T09:58:55.8688828Z Autotune Choices Stats: 2025-12-04T09:58:55.8689580Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_271", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.8689797Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8689961Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8690242Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8690884Z triton_flex_attention_backward_271 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8691526Z triton_flex_attention_backward_265 0.0185 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8692149Z triton_flex_attention_backward_262 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8692773Z triton_flex_attention_backward_263 0.0187 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8693409Z triton_flex_attention_backward_272 0.0202 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8694030Z triton_flex_attention_backward_273 0.0203 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8694658Z triton_flex_attention_backward_270 0.0218 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8695304Z triton_flex_attention_backward_275 0.0221 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8695974Z triton_flex_attention_backward_266 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8696605Z triton_flex_attention_backward_257 0.0229 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8696734Z SingleProcess AUTOTUNE benchmarking takes 0.2529 seconds and 0.8286 seconds precompiling for 22 choices 2025-12-04T09:58:55.8696812Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8696855Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8696894Z unimplemented [] 2025-12-04T09:58:55.8696970Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8697071Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8697640Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8697678Z graph_break [] 2025-12-04T09:58:55.8697751Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8697793Z Autotune Choices Stats: 2025-12-04T09:58:55.8698529Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_281", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011719999834895134, "best_triton_pos": 0} 2025-12-04T09:58:55.8698657Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8698770Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8698944Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8699581Z triton_flex_attention_281 0.0117 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8700186Z triton_flex_attention_282 0.0126 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8700789Z triton_flex_attention_280 0.0129 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8701391Z triton_flex_attention_279 0.0130 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8702008Z triton_flex_attention_283 0.0131 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8702613Z triton_flex_attention_298 0.0134 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8703217Z triton_flex_attention_290 0.0136 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8703852Z triton_flex_attention_296 0.0143 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8704446Z triton_flex_attention_288 0.0149 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8705049Z triton_flex_attention_294 0.0166 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8705179Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4440 seconds precompiling for 24 choices 2025-12-04T09:58:55.8705221Z Autotune Choices Stats: 2025-12-04T09:58:55.8706030Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_317", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.8706262Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8706427Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8706702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8707336Z triton_flex_attention_backward_317 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8707987Z triton_flex_attention_backward_311 0.0182 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8708622Z triton_flex_attention_backward_308 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8709246Z triton_flex_attention_backward_309 0.0188 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8709869Z triton_flex_attention_backward_318 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8710512Z triton_flex_attention_backward_319 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8711137Z triton_flex_attention_backward_316 0.0217 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8711768Z triton_flex_attention_backward_321 0.0221 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8712428Z triton_flex_attention_backward_312 0.0229 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8713053Z triton_flex_attention_backward_303 0.0230 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8713184Z SingleProcess AUTOTUNE benchmarking takes 0.2339 seconds and 0.7129 seconds precompiling for 22 choices 2025-12-04T09:58:55.8713261Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8713304Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8713347Z unimplemented [] 2025-12-04T09:58:55.8713409Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8713510Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8714077Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8714130Z graph_break [] 2025-12-04T09:58:55.8714204Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8714244Z Autotune Choices Stats: 2025-12-04T09:58:55.8714983Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010440000332891941, "best_triton_pos": 0} 2025-12-04T09:58:55.8715109Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8715223Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8715382Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8716103Z triton_flex_attention_329 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8716753Z triton_flex_attention_328 0.0120 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8717360Z triton_flex_attention_327 0.0123 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8717970Z triton_flex_attention_344 0.0131 ms 79.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8718574Z triton_flex_attention_336 0.0135 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8719184Z triton_flex_attention_326 0.0137 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8719784Z triton_flex_attention_325 0.0138 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8720387Z triton_flex_attention_342 0.0143 ms 73.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8721016Z triton_flex_attention_334 0.0149 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8721618Z triton_flex_attention_340 0.0164 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8721748Z SingleProcess AUTOTUNE benchmarking takes 0.2420 seconds and 0.4332 seconds precompiling for 24 choices 2025-12-04T09:58:55.8721789Z Autotune Choices Stats: 2025-12-04T09:58:55.8722550Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_363", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.8722781Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8722946Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8723224Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8723862Z triton_flex_attention_backward_363 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8724489Z triton_flex_attention_backward_357 0.0185 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8725138Z triton_flex_attention_backward_354 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8725761Z triton_flex_attention_backward_355 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8726429Z triton_flex_attention_backward_365 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8727054Z triton_flex_attention_backward_364 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8727695Z triton_flex_attention_backward_362 0.0220 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8728323Z triton_flex_attention_backward_367 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8728955Z triton_flex_attention_backward_358 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8729615Z triton_flex_attention_backward_349 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8729744Z SingleProcess AUTOTUNE benchmarking takes 0.2430 seconds and 0.7358 seconds precompiling for 22 choices 2025-12-04T09:58:55.8729822Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8729865Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8729904Z unimplemented [] 2025-12-04T09:58:55.8729965Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8730065Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8730637Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8730674Z graph_break [] 2025-12-04T09:58:55.8730748Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8730788Z Autotune Choices Stats: 2025-12-04T09:58:55.8731528Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.8731663Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8731778Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8731938Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8732553Z triton_flex_attention_375 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8733156Z triton_flex_attention_373 0.0114 ms 90.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8733794Z triton_flex_attention_374 0.0121 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8734393Z triton_flex_attention_372 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8735005Z triton_flex_attention_390 0.0132 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8735606Z triton_flex_attention_382 0.0138 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8736254Z triton_flex_attention_388 0.0140 ms 73.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8736855Z triton_flex_attention_380 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8737457Z triton_flex_attention_386 0.0164 ms 62.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8738099Z triton_flex_attention_378 0.0168 ms 61.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8738226Z SingleProcess AUTOTUNE benchmarking takes 0.2284 seconds and 0.4256 seconds precompiling for 24 choices 2025-12-04T09:58:55.8738268Z Autotune Choices Stats: 2025-12-04T09:58:55.8739029Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_409", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.8739246Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8739411Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8739699Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8740330Z triton_flex_attention_backward_409 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8740958Z triton_flex_attention_backward_403 0.0183 ms 86.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8741586Z triton_flex_attention_backward_400 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8742231Z triton_flex_attention_backward_401 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8742855Z triton_flex_attention_backward_411 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8743484Z triton_flex_attention_backward_410 0.0202 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8744111Z triton_flex_attention_backward_413 0.0218 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8744743Z triton_flex_attention_backward_408 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8745370Z triton_flex_attention_backward_404 0.0226 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8746038Z triton_flex_attention_backward_395 0.0233 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8746182Z SingleProcess AUTOTUNE benchmarking takes 0.2510 seconds and 0.7879 seconds precompiling for 22 choices 2025-12-04T09:58:55.8746259Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8746303Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8746341Z unimplemented [] 2025-12-04T09:58:55.8746426Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8746530Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8747103Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8747142Z graph_break [] 2025-12-04T09:58:55.8747217Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8747257Z Autotune Choices Stats: 2025-12-04T09:58:55.8748000Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_420", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009278999641537666, "best_triton_pos": 0} 2025-12-04T09:58:55.8748141Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8748253Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8748417Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8749026Z triton_flex_attention_420 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8749630Z triton_flex_attention_418 0.0101 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8750230Z triton_flex_attention_419 0.0115 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8750868Z triton_flex_attention_421 0.0124 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8751465Z triton_flex_attention_417 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8752069Z triton_flex_attention_436 0.0133 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8752674Z triton_flex_attention_428 0.0136 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8753284Z triton_flex_attention_426 0.0146 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8753887Z triton_flex_attention_434 0.0150 ms 62.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8754487Z triton_flex_attention_432 0.0162 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8754627Z SingleProcess AUTOTUNE benchmarking takes 0.2357 seconds and 0.4621 seconds precompiling for 24 choices 2025-12-04T09:58:55.8754667Z Autotune Choices Stats: 2025-12-04T09:58:55.8755451Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_455", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015558999963104725, "best_triton_pos": 0} 2025-12-04T09:58:55.8755666Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8755831Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8756137Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8756767Z triton_flex_attention_backward_455 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8757407Z triton_flex_attention_backward_449 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8758030Z triton_flex_attention_backward_446 0.0186 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8758653Z triton_flex_attention_backward_447 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8759320Z triton_flex_attention_backward_457 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8759946Z triton_flex_attention_backward_456 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8760566Z triton_flex_attention_backward_454 0.0218 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8761192Z triton_flex_attention_backward_459 0.0220 ms 70.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8761828Z triton_flex_attention_backward_441 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8762457Z triton_flex_attention_backward_450 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8762585Z SingleProcess AUTOTUNE benchmarking takes 0.2614 seconds and 0.6939 seconds precompiling for 22 choices 2025-12-04T09:58:55.8762659Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8762719Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8762755Z unimplemented [] 2025-12-04T09:58:55.8762817Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8762916Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8763510Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8763547Z graph_break [] 2025-12-04T09:58:55.8763629Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8763670Z Autotune Choices Stats: 2025-12-04T09:58:55.8764403Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_466", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008438999764621258, "best_triton_pos": 0} 2025-12-04T09:58:55.8764531Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8764644Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8764806Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8765430Z triton_flex_attention_466 0.0084 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8766064Z triton_flex_attention_467 0.0106 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8766667Z triton_flex_attention_465 0.0112 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8767263Z triton_flex_attention_462 0.0114 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8767908Z triton_flex_attention_464 0.0117 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8768513Z triton_flex_attention_463 0.0130 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8769118Z triton_flex_attention_482 0.0134 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8769718Z triton_flex_attention_474 0.0137 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8770337Z triton_flex_attention_480 0.0143 ms 58.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8770938Z triton_flex_attention_472 0.0148 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8771068Z SingleProcess AUTOTUNE benchmarking takes 0.2280 seconds and 0.3515 seconds precompiling for 24 choices 2025-12-04T09:58:55.8771108Z Autotune Choices Stats: 2025-12-04T09:58:55.8771897Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015399999916553497, "best_triton_pos": 0} 2025-12-04T09:58:55.8772116Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8772284Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8772568Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8773199Z triton_flex_attention_backward_501 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8773824Z triton_flex_attention_backward_495 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8774460Z triton_flex_attention_backward_492 0.0187 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8775089Z triton_flex_attention_backward_493 0.0190 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8775716Z triton_flex_attention_backward_503 0.0200 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8776420Z triton_flex_attention_backward_502 0.0202 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8777047Z triton_flex_attention_backward_500 0.0216 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8777673Z triton_flex_attention_backward_505 0.0219 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8778297Z triton_flex_attention_backward_496 0.0227 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8778935Z triton_flex_attention_backward_487 0.0228 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8779066Z SingleProcess AUTOTUNE benchmarking takes 0.2618 seconds and 0.8038 seconds precompiling for 22 choices 2025-12-04T09:58:55.8779139Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8779183Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8779220Z unimplemented [] 2025-12-04T09:58:55.8779284Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8779385Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8779963Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8780017Z graph_break [] 2025-12-04T09:58:55.8780091Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8780132Z Autotune Choices Stats: 2025-12-04T09:58:55.8780903Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_512", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009479000233113766, "best_triton_pos": 0} 2025-12-04T09:58:55.8781032Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8781145Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8781307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8781921Z triton_flex_attention_512 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8782536Z triton_flex_attention_510 0.0097 ms 97.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8783141Z triton_flex_attention_513 0.0110 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8783754Z triton_flex_attention_511 0.0120 ms 79.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8784351Z triton_flex_attention_509 0.0129 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8784991Z triton_flex_attention_528 0.0131 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8785596Z triton_flex_attention_520 0.0136 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8786229Z triton_flex_attention_526 0.0143 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8786866Z triton_flex_attention_518 0.0150 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8787469Z triton_flex_attention_524 0.0164 ms 57.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8787605Z SingleProcess AUTOTUNE benchmarking takes 0.2396 seconds and 0.4217 seconds precompiling for 24 choices 2025-12-04T09:58:55.8787650Z Autotune Choices Stats: 2025-12-04T09:58:55.8788413Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_547", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015879999846220016, "best_triton_pos": 0} 2025-12-04T09:58:55.8788654Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8788857Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8789141Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8789772Z triton_flex_attention_backward_547 0.0159 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8790403Z triton_flex_attention_backward_541 0.0184 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8791028Z triton_flex_attention_backward_538 0.0188 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8791667Z triton_flex_attention_backward_539 0.0188 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8792297Z triton_flex_attention_backward_549 0.0202 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8792927Z triton_flex_attention_backward_548 0.0203 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8793581Z triton_flex_attention_backward_546 0.0218 ms 73.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8794206Z triton_flex_attention_backward_551 0.0221 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8794836Z triton_flex_attention_backward_542 0.0227 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8795474Z triton_flex_attention_backward_533 0.0232 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8795607Z SingleProcess AUTOTUNE benchmarking takes 0.2602 seconds and 0.9028 seconds precompiling for 22 choices 2025-12-04T09:58:55.8795685Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8795729Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8795766Z unimplemented [] 2025-12-04T09:58:55.8795829Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8795963Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8796539Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8796577Z graph_break [] 2025-12-04T09:58:55.8796650Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8796712Z Autotune Choices Stats: 2025-12-04T09:58:55.8797489Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_556", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010160000063478947, "best_triton_pos": 0} 2025-12-04T09:58:55.8797618Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8797735Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8797899Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8798510Z triton_flex_attention_556 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8799112Z triton_flex_attention_559 0.0105 ms 96.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8799734Z triton_flex_attention_557 0.0117 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8800343Z triton_flex_attention_558 0.0120 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8800944Z triton_flex_attention_555 0.0130 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8801557Z triton_flex_attention_574 0.0131 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8802181Z triton_flex_attention_566 0.0140 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8802786Z triton_flex_attention_572 0.0143 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8803388Z triton_flex_attention_564 0.0152 ms 66.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8804005Z triton_flex_attention_570 0.0163 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8804136Z SingleProcess AUTOTUNE benchmarking takes 0.2442 seconds and 0.5472 seconds precompiling for 24 choices 2025-12-04T09:58:55.8804175Z Autotune Choices Stats: 2025-12-04T09:58:55.8804931Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.8805152Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8805315Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8805605Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8806310Z triton_flex_attention_backward_593 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8806931Z triton_flex_attention_backward_587 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8807553Z triton_flex_attention_backward_584 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8808196Z triton_flex_attention_backward_585 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8808824Z triton_flex_attention_backward_595 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8809447Z triton_flex_attention_backward_594 0.0201 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8810068Z triton_flex_attention_backward_592 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8810739Z triton_flex_attention_backward_597 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8811364Z triton_flex_attention_backward_588 0.0226 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8811983Z triton_flex_attention_backward_579 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8812122Z SingleProcess AUTOTUNE benchmarking takes 0.2676 seconds and 0.8099 seconds precompiling for 22 choices 2025-12-04T09:58:55.8812198Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8812240Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8812278Z unimplemented [] 2025-12-04T09:58:55.8812339Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8812440Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8813014Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8813053Z graph_break [] 2025-12-04T09:58:55.8813127Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8813167Z Autotune Choices Stats: 2025-12-04T09:58:55.8813906Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.01092000026255846, "best_triton_pos": 0} 2025-12-04T09:58:55.8814053Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8814166Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8814336Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8814967Z triton_flex_attention_605 0.0109 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8815570Z triton_flex_attention_603 0.0117 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8816209Z triton_flex_attention_604 0.0122 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8816830Z triton_flex_attention_602 0.0132 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8817436Z triton_flex_attention_620 0.0134 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8818039Z triton_flex_attention_612 0.0136 ms 80.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8818647Z triton_flex_attention_601 0.0138 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8819280Z triton_flex_attention_618 0.0141 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8819885Z triton_flex_attention_610 0.0149 ms 73.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8820479Z triton_flex_attention_616 0.0163 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8820621Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.4054 seconds precompiling for 24 choices 2025-12-04T09:58:55.8820663Z Autotune Choices Stats: 2025-12-04T09:58:55.8821417Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_639", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015438999980688095, "best_triton_pos": 0} 2025-12-04T09:58:55.8821639Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8821807Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8822082Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8822714Z triton_flex_attention_backward_639 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8823375Z triton_flex_attention_backward_633 0.0181 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8823999Z triton_flex_attention_backward_631 0.0186 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8824624Z triton_flex_attention_backward_630 0.0187 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8825268Z triton_flex_attention_backward_641 0.0201 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8825889Z triton_flex_attention_backward_640 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8826547Z triton_flex_attention_backward_638 0.0217 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8827191Z triton_flex_attention_backward_643 0.0220 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8827840Z triton_flex_attention_backward_634 0.0227 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8828468Z triton_flex_attention_backward_625 0.0228 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8828596Z SingleProcess AUTOTUNE benchmarking takes 0.2568 seconds and 0.8500 seconds precompiling for 22 choices 2025-12-04T09:58:55.8828670Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8828714Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8828751Z unimplemented [] 2025-12-04T09:58:55.8828813Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8828934Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8829506Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8829545Z graph_break [] 2025-12-04T09:58:55.8829618Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8829660Z Autotune Choices Stats: 2025-12-04T09:58:55.8830398Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_648", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009879999794065952, "best_triton_pos": 0} 2025-12-04T09:58:55.8830523Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8830637Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8830814Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8831436Z triton_flex_attention_648 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8832032Z triton_flex_attention_649 0.0116 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8832639Z triton_flex_attention_651 0.0121 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8833239Z triton_flex_attention_650 0.0128 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8833854Z triton_flex_attention_666 0.0132 ms 74.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8834452Z triton_flex_attention_647 0.0135 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8835058Z triton_flex_attention_658 0.0138 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8835683Z triton_flex_attention_664 0.0143 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8836336Z triton_flex_attention_656 0.0149 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8836943Z triton_flex_attention_662 0.0164 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8837072Z SingleProcess AUTOTUNE benchmarking takes 0.2582 seconds and 0.4752 seconds precompiling for 24 choices 2025-12-04T09:58:55.8837115Z Autotune Choices Stats: 2025-12-04T09:58:55.8837872Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_685", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.8838104Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8838267Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8838546Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8839191Z triton_flex_attention_backward_685 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8839827Z triton_flex_attention_backward_679 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8840472Z triton_flex_attention_backward_677 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8841096Z triton_flex_attention_backward_676 0.0188 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8841726Z triton_flex_attention_backward_687 0.0201 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8842366Z triton_flex_attention_backward_686 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8842980Z triton_flex_attention_backward_684 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8843607Z triton_flex_attention_backward_689 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8844252Z triton_flex_attention_backward_680 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8844886Z triton_flex_attention_backward_671 0.0231 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8845015Z SingleProcess AUTOTUNE benchmarking takes 0.2670 seconds and 0.8704 seconds precompiling for 22 choices 2025-12-04T09:58:55.8845093Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8845135Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8845174Z unimplemented [] 2025-12-04T09:58:55.8845235Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8845335Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8845912Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8846005Z graph_break [] 2025-12-04T09:58:55.8846081Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8846122Z Autotune Choices Stats: 2025-12-04T09:58:55.8846867Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.8846993Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8847109Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8847267Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8847874Z triton_flex_attention_697 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8848515Z triton_flex_attention_694 0.0107 ms 94.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8849117Z triton_flex_attention_696 0.0110 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8849720Z triton_flex_attention_695 0.0117 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8850321Z triton_flex_attention_693 0.0130 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8850945Z triton_flex_attention_712 0.0132 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8851547Z triton_flex_attention_704 0.0136 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8852151Z triton_flex_attention_710 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8852792Z triton_flex_attention_702 0.0147 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8853395Z triton_flex_attention_708 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8853524Z SingleProcess AUTOTUNE benchmarking takes 0.2451 seconds and 0.5257 seconds precompiling for 24 choices 2025-12-04T09:58:55.8853567Z Autotune Choices Stats: 2025-12-04T09:58:55.8854325Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_731", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.8854554Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8854721Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8854997Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8855629Z triton_flex_attention_backward_731 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8856297Z triton_flex_attention_backward_725 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8856944Z triton_flex_attention_backward_723 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8857577Z triton_flex_attention_backward_722 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8858205Z triton_flex_attention_backward_733 0.0202 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8858829Z triton_flex_attention_backward_732 0.0203 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8859473Z triton_flex_attention_backward_730 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8860098Z triton_flex_attention_backward_735 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8860725Z triton_flex_attention_backward_726 0.0228 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8864362Z triton_flex_attention_backward_717 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8864506Z SingleProcess AUTOTUNE benchmarking takes 0.2731 seconds and 0.7158 seconds precompiling for 22 choices 2025-12-04T09:58:55.8864579Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8864626Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8864663Z unimplemented [] 2025-12-04T09:58:55.8864724Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8864825Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8865396Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 72), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 26), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 10), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8865435Z graph_break [] 2025-12-04T09:58:55.8865510Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8865551Z Autotune Choices Stats: 2025-12-04T09:58:55.8866332Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010319000110030174, "best_triton_pos": 0} 2025-12-04T09:58:55.8866476Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8866590Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8866756Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8867369Z triton_flex_attention_743 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8867968Z triton_flex_attention_740 0.0104 ms 98.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8868606Z triton_flex_attention_741 0.0117 ms 88.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8869203Z triton_flex_attention_742 0.0120 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8869806Z triton_flex_attention_750 0.0135 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8870405Z triton_flex_attention_758 0.0137 ms 75.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8871020Z triton_flex_attention_756 0.0143 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8871623Z triton_flex_attention_748 0.0150 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8872225Z triton_flex_attention_754 0.0164 ms 63.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8872852Z triton_flex_attention_739 0.0164 ms 62.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8872981Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.4263 seconds precompiling for 24 choices 2025-12-04T09:58:55.8873022Z Autotune Choices Stats: 2025-12-04T09:58:55.8873786Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_777", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.8874004Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8874172Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8874464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8875092Z triton_flex_attention_backward_777 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8875721Z triton_flex_attention_backward_771 0.0182 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8876384Z triton_flex_attention_backward_768 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8877046Z triton_flex_attention_backward_769 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8877672Z triton_flex_attention_backward_779 0.0199 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8878301Z triton_flex_attention_backward_778 0.0200 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8878926Z triton_flex_attention_backward_781 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8879561Z triton_flex_attention_backward_776 0.0218 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8880190Z triton_flex_attention_backward_772 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8880817Z triton_flex_attention_backward_763 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8880966Z SingleProcess AUTOTUNE benchmarking takes 0.2236 seconds and 0.6720 seconds precompiling for 22 choices 2025-12-04T09:58:55.8881039Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8881083Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8881122Z unimplemented [] 2025-12-04T09:58:55.8881195Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8881309Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8881882Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 27), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 11), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8881920Z graph_break [] 2025-12-04T09:58:55.8881997Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8882038Z Autotune Choices Stats: 2025-12-04T09:58:55.8882777Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011358999647200108, "best_triton_pos": 0} 2025-12-04T09:58:55.8882918Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8883030Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8883193Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8883807Z triton_flex_attention_789 0.0114 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8884414Z triton_flex_attention_787 0.0125 ms 91.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8885021Z triton_flex_attention_785 0.0127 ms 89.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8885651Z triton_flex_attention_788 0.0130 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8886294Z triton_flex_attention_786 0.0132 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8886901Z triton_flex_attention_796 0.0133 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8887499Z triton_flex_attention_804 0.0139 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8888117Z triton_flex_attention_802 0.0145 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8888717Z triton_flex_attention_794 0.0150 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8889324Z triton_flex_attention_800 0.0162 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8889471Z SingleProcess AUTOTUNE benchmarking takes 0.2567 seconds and 0.4614 seconds precompiling for 24 choices 2025-12-04T09:58:55.8889510Z Autotune Choices Stats: 2025-12-04T09:58:55.8890291Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_823", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.8890509Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8890674Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8890959Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8891591Z triton_flex_attention_backward_823 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8892224Z triton_flex_attention_backward_817 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8892855Z triton_flex_attention_backward_815 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8893481Z triton_flex_attention_backward_814 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8894130Z triton_flex_attention_backward_825 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8894753Z triton_flex_attention_backward_824 0.0204 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8895376Z triton_flex_attention_backward_822 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8896034Z triton_flex_attention_backward_827 0.0220 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8896676Z triton_flex_attention_backward_809 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8897303Z triton_flex_attention_backward_818 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8897437Z SingleProcess AUTOTUNE benchmarking takes 0.3762 seconds and 0.8858 seconds precompiling for 22 choices 2025-12-04T09:58:55.8897510Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8897568Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8897607Z unimplemented [] 2025-12-04T09:58:55.8897669Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8897772Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8898368Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8898407Z graph_break [] 2025-12-04T09:58:55.8898480Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8898522Z Autotune Choices Stats: 2025-12-04T09:58:55.8899257Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_834", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.8899384Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8899496Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8899659Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8900289Z triton_flex_attention_834 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8900888Z triton_flex_attention_832 0.0102 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8901494Z triton_flex_attention_835 0.0106 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8902094Z triton_flex_attention_833 0.0115 ms 76.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8902729Z triton_flex_attention_850 0.0132 ms 66.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8903328Z triton_flex_attention_842 0.0137 ms 63.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8903929Z triton_flex_attention_831 0.0140 ms 62.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8904542Z triton_flex_attention_848 0.0144 ms 60.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8905156Z triton_flex_attention_840 0.0149 ms 58.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8905760Z triton_flex_attention_846 0.0165 ms 52.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8905890Z SingleProcess AUTOTUNE benchmarking takes 0.2264 seconds and 0.3728 seconds precompiling for 24 choices 2025-12-04T09:58:55.8905962Z Autotune Choices Stats: 2025-12-04T09:58:55.8906745Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_869", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.8906985Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8907151Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8907429Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8908061Z triton_flex_attention_backward_869 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8908681Z triton_flex_attention_backward_863 0.0184 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8909321Z triton_flex_attention_backward_861 0.0189 ms 82.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8909943Z triton_flex_attention_backward_860 0.0190 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8910570Z triton_flex_attention_backward_871 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8911242Z triton_flex_attention_backward_870 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8911866Z triton_flex_attention_backward_868 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8912495Z triton_flex_attention_backward_873 0.0221 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8913120Z triton_flex_attention_backward_864 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8913751Z triton_flex_attention_backward_855 0.0230 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8913880Z SingleProcess AUTOTUNE benchmarking takes 0.2653 seconds and 0.9077 seconds precompiling for 22 choices 2025-12-04T09:58:55.8913954Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8913997Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8914033Z unimplemented [] 2025-12-04T09:58:55.8914097Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8914198Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8914771Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8914821Z graph_break [] 2025-12-04T09:58:55.8914893Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8914934Z Autotune Choices Stats: 2025-12-04T09:58:55.8915699Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.009680000133812428, "best_triton_pos": 0} 2025-12-04T09:58:55.8915827Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8915979Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8916141Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8916747Z triton_flex_attention_881 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8917359Z triton_flex_attention_878 0.0104 ms 93.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8917971Z triton_flex_attention_880 0.0112 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8918573Z triton_flex_attention_879 0.0113 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8919172Z triton_flex_attention_877 0.0130 ms 74.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8919813Z triton_flex_attention_896 0.0131 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8920420Z triton_flex_attention_888 0.0135 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8921026Z triton_flex_attention_894 0.0141 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8921629Z triton_flex_attention_886 0.0147 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8922249Z triton_flex_attention_892 0.0163 ms 59.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8922380Z SingleProcess AUTOTUNE benchmarking takes 0.2411 seconds and 0.4500 seconds precompiling for 24 choices 2025-12-04T09:58:55.8922421Z Autotune Choices Stats: 2025-12-04T09:58:55.8923172Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_915", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015838999301195145, "best_triton_pos": 0} 2025-12-04T09:58:55.8923400Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8923575Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8923866Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8924505Z triton_flex_attention_backward_915 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8925128Z triton_flex_attention_backward_909 0.0183 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8925753Z triton_flex_attention_backward_907 0.0186 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8926416Z triton_flex_attention_backward_906 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8927046Z triton_flex_attention_backward_917 0.0201 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8927668Z triton_flex_attention_backward_916 0.0204 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8928330Z triton_flex_attention_backward_914 0.0220 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8928959Z triton_flex_attention_backward_919 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8929586Z triton_flex_attention_backward_910 0.0228 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8930218Z triton_flex_attention_backward_901 0.0230 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8930346Z SingleProcess AUTOTUNE benchmarking takes 0.2649 seconds and 0.6858 seconds precompiling for 22 choices 2025-12-04T09:58:55.8930422Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8930464Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8930505Z unimplemented [] 2025-12-04T09:58:55.8930568Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8930669Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8931247Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8931291Z graph_break [] 2025-12-04T09:58:55.8931363Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8931404Z Autotune Choices Stats: 2025-12-04T09:58:55.8932162Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_926", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.8932302Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8932416Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8932578Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8933196Z triton_flex_attention_926 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8933801Z triton_flex_attention_925 0.0118 ms 88.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8934409Z triton_flex_attention_942 0.0132 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8935010Z triton_flex_attention_923 0.0132 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8935623Z triton_flex_attention_927 0.0134 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8936261Z triton_flex_attention_924 0.0134 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8936896Z triton_flex_attention_934 0.0136 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8937493Z triton_flex_attention_940 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8938100Z triton_flex_attention_932 0.0148 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8938716Z triton_flex_attention_938 0.0163 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8938844Z SingleProcess AUTOTUNE benchmarking takes 0.2462 seconds and 0.4391 seconds precompiling for 24 choices 2025-12-04T09:58:55.8938887Z Autotune Choices Stats: 2025-12-04T09:58:55.8939635Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_961", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.8939854Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8940020Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8940305Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8940954Z triton_flex_attention_backward_961 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8941576Z triton_flex_attention_backward_955 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8942200Z triton_flex_attention_backward_952 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8942824Z triton_flex_attention_backward_953 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8943461Z triton_flex_attention_backward_963 0.0198 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8944091Z triton_flex_attention_backward_962 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8944718Z triton_flex_attention_backward_965 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8945375Z triton_flex_attention_backward_960 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8946035Z triton_flex_attention_backward_956 0.0225 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8946662Z triton_flex_attention_backward_947 0.0232 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8946806Z SingleProcess AUTOTUNE benchmarking takes 0.2257 seconds and 0.8452 seconds precompiling for 22 choices 2025-12-04T09:58:55.8946883Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8946926Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8946966Z unimplemented [] 2025-12-04T09:58:55.8947029Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8947133Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8947699Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8947738Z graph_break [] 2025-12-04T09:58:55.8947811Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8947855Z Autotune Choices Stats: 2025-12-04T09:58:55.8948597Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_972", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00887999963015318, "best_triton_pos": 0} 2025-12-04T09:58:55.8948736Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8948854Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8949026Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8949649Z triton_flex_attention_972 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8952794Z triton_flex_attention_970 0.0100 ms 88.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8953401Z triton_flex_attention_971 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8954022Z triton_flex_attention_973 0.0123 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8954620Z triton_flex_attention_969 0.0131 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8955217Z triton_flex_attention_980 0.0136 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8955829Z triton_flex_attention_988 0.0136 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8956512Z triton_flex_attention_986 0.0140 ms 63.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8957107Z triton_flex_attention_978 0.0150 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8957707Z triton_flex_attention_984 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8957851Z SingleProcess AUTOTUNE benchmarking takes 0.2423 seconds and 0.4183 seconds precompiling for 24 choices 2025-12-04T09:58:55.8957894Z Autotune Choices Stats: 2025-12-04T09:58:55.8958649Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1007", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01568000018596649, "best_triton_pos": 0} 2025-12-04T09:58:55.8958869Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8959038Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8959313Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8959946Z triton_flex_attention_backward_1007 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8960606Z triton_flex_attention_backward_1001 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8961229Z triton_flex_attention_backward_999 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8961851Z triton_flex_attention_backward_998 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8962490Z triton_flex_attention_backward_1008 0.0202 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8963122Z triton_flex_attention_backward_1009 0.0203 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8963745Z triton_flex_attention_backward_1006 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8964382Z triton_flex_attention_backward_1011 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8965035Z triton_flex_attention_backward_1002 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8965657Z triton_flex_attention_backward_993 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8965787Z SingleProcess AUTOTUNE benchmarking takes 0.2732 seconds and 0.7139 seconds precompiling for 22 choices 2025-12-04T09:58:55.8965863Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8965905Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8965980Z unimplemented [] 2025-12-04T09:58:55.8966042Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8966158Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8966728Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8966765Z graph_break [] 2025-12-04T09:58:55.8966844Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8966884Z Autotune Choices Stats: 2025-12-04T09:58:55.8967620Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1018", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009320000186562538, "best_triton_pos": 0} 2025-12-04T09:58:55.8967747Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8967864Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8968036Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8968680Z triton_flex_attention_1018 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8969290Z triton_flex_attention_1019 0.0113 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8969892Z triton_flex_attention_1017 0.0116 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8970489Z triton_flex_attention_1015 0.0131 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8971101Z triton_flex_attention_1016 0.0132 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8971706Z triton_flex_attention_1026 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8972305Z triton_flex_attention_1034 0.0138 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8972923Z triton_flex_attention_1032 0.0144 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8973538Z triton_flex_attention_1024 0.0149 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8974143Z triton_flex_attention_1030 0.0165 ms 56.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8974271Z SingleProcess AUTOTUNE benchmarking takes 0.2485 seconds and 0.5090 seconds precompiling for 24 choices 2025-12-04T09:58:55.8974312Z Autotune Choices Stats: 2025-12-04T09:58:55.8975071Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1053", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.8975299Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8975464Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8975745Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8976412Z triton_flex_attention_backward_1053 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8977048Z triton_flex_attention_backward_1047 0.0180 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8977698Z triton_flex_attention_backward_1044 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8978325Z triton_flex_attention_backward_1045 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8978951Z triton_flex_attention_backward_1054 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8979593Z triton_flex_attention_backward_1055 0.0203 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8980219Z triton_flex_attention_backward_1052 0.0218 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8980848Z triton_flex_attention_backward_1057 0.0221 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8981490Z triton_flex_attention_backward_1039 0.0228 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8982124Z triton_flex_attention_backward_1048 0.0229 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8982255Z SingleProcess AUTOTUNE benchmarking takes 0.2557 seconds and 0.8372 seconds precompiling for 22 choices 2025-12-04T09:58:55.8982330Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8982373Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8982410Z unimplemented [] 2025-12-04T09:58:55.8982473Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8982573Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8983145Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.8983192Z graph_break [] 2025-12-04T09:58:55.8983267Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8983306Z Autotune Choices Stats: 2025-12-04T09:58:55.8984045Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1062", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01023900043219328, "best_triton_pos": 0} 2025-12-04T09:58:55.8984175Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8984288Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8984453Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8985064Z triton_flex_attention_1062 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8985706Z triton_flex_attention_1064 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8986438Z triton_flex_attention_1065 0.0104 ms 98.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8987041Z triton_flex_attention_1063 0.0113 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8987643Z triton_flex_attention_1080 0.0131 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8988256Z triton_flex_attention_1072 0.0136 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8988859Z triton_flex_attention_1061 0.0141 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8989469Z triton_flex_attention_1078 0.0142 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8990107Z triton_flex_attention_1070 0.0146 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8990706Z triton_flex_attention_1076 0.0164 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8990838Z SingleProcess AUTOTUNE benchmarking takes 0.2443 seconds and 0.3731 seconds precompiling for 24 choices 2025-12-04T09:58:55.8990877Z Autotune Choices Stats: 2025-12-04T09:58:55.8991637Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1099", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.8991863Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.8992027Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.8992306Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.8992935Z triton_flex_attention_backward_1099 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8993563Z triton_flex_attention_backward_1093 0.0184 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8994204Z triton_flex_attention_backward_1090 0.0186 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8994838Z triton_flex_attention_backward_1091 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8995466Z triton_flex_attention_backward_1101 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8996137Z triton_flex_attention_backward_1100 0.0203 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8996774Z triton_flex_attention_backward_1098 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8997401Z triton_flex_attention_backward_1103 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.8998027Z triton_flex_attention_backward_1094 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8998687Z triton_flex_attention_backward_1085 0.0232 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.8998817Z SingleProcess AUTOTUNE benchmarking takes 0.2682 seconds and 0.7614 seconds precompiling for 22 choices 2025-12-04T09:58:55.8998891Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.8998935Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.8998971Z unimplemented [] 2025-12-04T09:58:55.8999034Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.8999135Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.8999711Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 71), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 25), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 9), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.8999748Z graph_break [] 2025-12-04T09:58:55.8999823Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.8999864Z Autotune Choices Stats: 2025-12-04T09:58:55.9000611Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1110", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00872000027447939, "best_triton_pos": 0} 2025-12-04T09:58:55.9000753Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9000865Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9001026Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9001636Z triton_flex_attention_1110 0.0087 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9002236Z triton_flex_attention_1111 0.0107 ms 81.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9002877Z triton_flex_attention_1106 0.0114 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9003478Z triton_flex_attention_1109 0.0124 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9004084Z triton_flex_attention_1126 0.0132 ms 66.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9004685Z triton_flex_attention_1107 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9005296Z triton_flex_attention_1108 0.0132 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9005898Z triton_flex_attention_1118 0.0136 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9006533Z triton_flex_attention_1124 0.0144 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9007171Z triton_flex_attention_1116 0.0149 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9007300Z SingleProcess AUTOTUNE benchmarking takes 0.2221 seconds and 0.4859 seconds precompiling for 24 choices 2025-12-04T09:58:55.9007339Z Autotune Choices Stats: 2025-12-04T09:58:55.9008169Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1145", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015799999237060547, "best_triton_pos": 0} 2025-12-04T09:58:55.9008387Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9008550Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9008841Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9009475Z triton_flex_attention_backward_1145 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9010099Z triton_flex_attention_backward_1139 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9010720Z triton_flex_attention_backward_1136 0.0188 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9011376Z triton_flex_attention_backward_1137 0.0189 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9012001Z triton_flex_attention_backward_1147 0.0199 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9012627Z triton_flex_attention_backward_1146 0.0200 ms 79.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9013251Z triton_flex_attention_backward_1144 0.0219 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9013893Z triton_flex_attention_backward_1149 0.0220 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9014519Z triton_flex_attention_backward_1140 0.0225 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9015142Z triton_flex_attention_backward_1131 0.0229 ms 69.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9015281Z SingleProcess AUTOTUNE benchmarking takes 0.2619 seconds and 0.8417 seconds precompiling for 22 choices 2025-12-04T09:58:55.9015356Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9015400Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9015436Z unimplemented [] 2025-12-04T09:58:55.9015520Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9015620Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9016227Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9016266Z graph_break [] 2025-12-04T09:58:55.9016339Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9016380Z Autotune Choices Stats: 2025-12-04T09:58:55.9017112Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1155", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.011118999682366848, "best_triton_pos": 0} 2025-12-04T09:58:55.9017252Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9017366Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9017528Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9018140Z triton_flex_attention_1155 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9018744Z triton_flex_attention_1156 0.0120 ms 93.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9019344Z triton_flex_attention_1154 0.0127 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9019986Z triton_flex_attention_1172 0.0132 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9020591Z triton_flex_attention_1157 0.0132 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9021193Z triton_flex_attention_1153 0.0133 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9021794Z triton_flex_attention_1164 0.0136 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9022408Z triton_flex_attention_1170 0.0139 ms 80.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9023016Z triton_flex_attention_1162 0.0148 ms 75.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9023616Z triton_flex_attention_1168 0.0166 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9023755Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.3740 seconds precompiling for 24 choices 2025-12-04T09:58:55.9023795Z Autotune Choices Stats: 2025-12-04T09:58:55.9024570Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1191", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.9024789Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9024954Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9025233Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9025960Z triton_flex_attention_backward_1191 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9026604Z triton_flex_attention_backward_1185 0.0182 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9027233Z triton_flex_attention_backward_1183 0.0188 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9027856Z triton_flex_attention_backward_1182 0.0188 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9028518Z triton_flex_attention_backward_1193 0.0202 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9029144Z triton_flex_attention_backward_1192 0.0203 ms 76.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9029770Z triton_flex_attention_backward_1190 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9030399Z triton_flex_attention_backward_1195 0.0220 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9031038Z triton_flex_attention_backward_1186 0.0227 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9031667Z triton_flex_attention_backward_1177 0.0229 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9031796Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.6788 seconds precompiling for 22 choices 2025-12-04T09:58:55.9031870Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9031925Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9031963Z unimplemented [] 2025-12-04T09:58:55.9032025Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9032126Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9032719Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.9032756Z graph_break [] 2025-12-04T09:58:55.9032829Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9032870Z Autotune Choices Stats: 2025-12-04T09:58:55.9033609Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1200", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.9033739Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9033852Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9034012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9034632Z triton_flex_attention_1200 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9035236Z triton_flex_attention_1202 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9035847Z triton_flex_attention_1218 0.0132 ms 76.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9036494Z triton_flex_attention_1210 0.0136 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9037132Z triton_flex_attention_1199 0.0138 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9037734Z triton_flex_attention_1203 0.0142 ms 70.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9038337Z triton_flex_attention_1216 0.0146 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9038936Z triton_flex_attention_1201 0.0150 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9039549Z triton_flex_attention_1208 0.0151 ms 66.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9040150Z triton_flex_attention_1214 0.0163 ms 61.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9040279Z SingleProcess AUTOTUNE benchmarking takes 0.2437 seconds and 0.5227 seconds precompiling for 24 choices 2025-12-04T09:58:55.9040321Z Autotune Choices Stats: 2025-12-04T09:58:55.9041114Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1237", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015440000221133232, "best_triton_pos": 0} 2025-12-04T09:58:55.9041331Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9041494Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9041771Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9042403Z triton_flex_attention_backward_1237 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9043028Z triton_flex_attention_backward_1231 0.0181 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9043662Z triton_flex_attention_backward_1228 0.0187 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9044289Z triton_flex_attention_backward_1229 0.0189 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9044921Z triton_flex_attention_backward_1239 0.0201 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9045579Z triton_flex_attention_backward_1238 0.0204 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9046237Z triton_flex_attention_backward_1236 0.0217 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9046866Z triton_flex_attention_backward_1241 0.0222 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9047493Z triton_flex_attention_backward_1232 0.0228 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9048132Z triton_flex_attention_backward_1223 0.0231 ms 66.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9048259Z SingleProcess AUTOTUNE benchmarking takes 0.2673 seconds and 0.9084 seconds precompiling for 22 choices 2025-12-04T09:58:55.9048334Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9048375Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9048413Z unimplemented [] 2025-12-04T09:58:55.9048476Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9048575Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9049148Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9049209Z graph_break [] 2025-12-04T09:58:55.9049283Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9049326Z Autotune Choices Stats: 2025-12-04T09:58:55.9050088Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1248", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.00875999964773655, "best_triton_pos": 0} 2025-12-04T09:58:55.9050214Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9050328Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9050489Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9051104Z triton_flex_attention_1248 0.0088 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9051724Z triton_flex_attention_1249 0.0105 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9052323Z triton_flex_attention_1244 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9052927Z triton_flex_attention_1246 0.0110 ms 79.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9053542Z triton_flex_attention_1247 0.0117 ms 74.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9054176Z triton_flex_attention_1245 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9054780Z triton_flex_attention_1264 0.0131 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9055383Z triton_flex_attention_1256 0.0136 ms 64.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9056043Z triton_flex_attention_1262 0.0143 ms 61.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9056645Z triton_flex_attention_1254 0.0149 ms 58.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9056773Z SingleProcess AUTOTUNE benchmarking takes 0.2195 seconds and 0.4105 seconds precompiling for 24 choices 2025-12-04T09:58:55.9056814Z Autotune Choices Stats: 2025-12-04T09:58:55.9057581Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1283", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.9057813Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9058004Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9058282Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9058919Z triton_flex_attention_backward_1283 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9059548Z triton_flex_attention_backward_1277 0.0183 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9060182Z triton_flex_attention_backward_1274 0.0186 ms 84.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9060804Z triton_flex_attention_backward_1275 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9061435Z triton_flex_attention_backward_1285 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9062062Z triton_flex_attention_backward_1284 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9062713Z triton_flex_attention_backward_1282 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9063340Z triton_flex_attention_backward_1287 0.0222 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9063965Z triton_flex_attention_backward_1278 0.0229 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9064606Z triton_flex_attention_backward_1269 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9064733Z SingleProcess AUTOTUNE benchmarking takes 0.2711 seconds and 0.8455 seconds precompiling for 22 choices 2025-12-04T09:58:55.9064807Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9064850Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9064889Z unimplemented [] 2025-12-04T09:58:55.9064951Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9065051Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9065621Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9065659Z graph_break [] 2025-12-04T09:58:55.9065733Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9065782Z Autotune Choices Stats: 2025-12-04T09:58:55.9066574Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010520000010728836, "best_triton_pos": 0} 2025-12-04T09:58:55.9066700Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9066814Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9066976Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9067595Z triton_flex_attention_1295 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9068198Z triton_flex_attention_1292 0.0127 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9068813Z triton_flex_attention_1291 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9069415Z triton_flex_attention_1294 0.0129 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9070016Z triton_flex_attention_1293 0.0131 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9070635Z triton_flex_attention_1310 0.0132 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9071254Z triton_flex_attention_1302 0.0137 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9071860Z triton_flex_attention_1308 0.0142 ms 73.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9072460Z triton_flex_attention_1300 0.0150 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9073071Z triton_flex_attention_1306 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9073199Z SingleProcess AUTOTUNE benchmarking takes 0.2490 seconds and 0.5807 seconds precompiling for 24 choices 2025-12-04T09:58:55.9073240Z Autotune Choices Stats: 2025-12-04T09:58:55.9074010Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.9074226Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9074390Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9074677Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9075332Z triton_flex_attention_backward_1329 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9076000Z triton_flex_attention_backward_1323 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9076622Z triton_flex_attention_backward_1320 0.0187 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9077264Z triton_flex_attention_backward_1321 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9077896Z triton_flex_attention_backward_1331 0.0198 ms 79.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9078531Z triton_flex_attention_backward_1330 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9079173Z triton_flex_attention_backward_1333 0.0217 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9079831Z triton_flex_attention_backward_1328 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9080466Z triton_flex_attention_backward_1324 0.0225 ms 69.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9081089Z triton_flex_attention_backward_1315 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9081243Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.8502 seconds precompiling for 22 choices 2025-12-04T09:58:55.9081318Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9081365Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9081403Z unimplemented [] 2025-12-04T09:58:55.9081464Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9081564Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9082141Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9082179Z graph_break [] 2025-12-04T09:58:55.9082254Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9082294Z Autotune Choices Stats: 2025-12-04T09:58:55.9083043Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1338", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.011800000444054604, "best_triton_pos": 0} 2025-12-04T09:58:55.9083185Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9083303Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9083593Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9084264Z triton_flex_attention_1338 0.0118 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9084963Z triton_flex_attention_1340 0.0118 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9085598Z triton_flex_attention_1339 0.0122 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9086247Z triton_flex_attention_1337 0.0128 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9086945Z triton_flex_attention_1356 0.0131 ms 90.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9087582Z triton_flex_attention_1348 0.0136 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9088237Z triton_flex_attention_1354 0.0140 ms 84.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9088857Z triton_flex_attention_1341 0.0142 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9089458Z triton_flex_attention_1346 0.0150 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9090066Z triton_flex_attention_1352 0.0164 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9090220Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.3456 seconds precompiling for 24 choices 2025-12-04T09:58:55.9090264Z Autotune Choices Stats: 2025-12-04T09:58:55.9091024Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1375", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.9091244Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9091410Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9091687Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9092329Z triton_flex_attention_backward_1375 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9092974Z triton_flex_attention_backward_1369 0.0182 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9093599Z triton_flex_attention_backward_1367 0.0186 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9094223Z triton_flex_attention_backward_1366 0.0187 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9094863Z triton_flex_attention_backward_1377 0.0202 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9095489Z triton_flex_attention_backward_1376 0.0204 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9096172Z triton_flex_attention_backward_1374 0.0216 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9096849Z triton_flex_attention_backward_1379 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9097488Z triton_flex_attention_backward_1370 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9098114Z triton_flex_attention_backward_1361 0.0228 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9098242Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.9039 seconds precompiling for 22 choices 2025-12-04T09:58:55.9098321Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9098365Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9098415Z unimplemented [] 2025-12-04T09:58:55.9098476Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9098577Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9099151Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.9099188Z graph_break [] 2025-12-04T09:58:55.9099262Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9099303Z Autotune Choices Stats: 2025-12-04T09:58:55.9100041Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1386", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.9100167Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9100281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9100455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9101093Z triton_flex_attention_1386 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9101696Z triton_flex_attention_1384 0.0100 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9102300Z triton_flex_attention_1387 0.0106 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9102900Z triton_flex_attention_1382 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9103519Z triton_flex_attention_1383 0.0129 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9104118Z triton_flex_attention_1385 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9104722Z triton_flex_attention_1402 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9105358Z triton_flex_attention_1400 0.0145 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9105981Z triton_flex_attention_1394 0.0149 ms 62.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9106585Z triton_flex_attention_1392 0.0150 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9106712Z SingleProcess AUTOTUNE benchmarking takes 0.2334 seconds and 0.3596 seconds precompiling for 24 choices 2025-12-04T09:58:55.9106754Z Autotune Choices Stats: 2025-12-04T09:58:55.9107518Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1421", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.9107746Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9107910Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9108187Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9108814Z triton_flex_attention_backward_1421 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9109475Z triton_flex_attention_backward_1415 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9110102Z triton_flex_attention_backward_1413 0.0187 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9110726Z triton_flex_attention_backward_1412 0.0189 ms 82.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9111349Z triton_flex_attention_backward_1423 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9111994Z triton_flex_attention_backward_1422 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9112615Z triton_flex_attention_backward_1420 0.0218 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9113243Z triton_flex_attention_backward_1425 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9113904Z triton_flex_attention_backward_1407 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9114535Z triton_flex_attention_backward_1416 0.0227 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9114664Z SingleProcess AUTOTUNE benchmarking takes 0.2526 seconds and 0.7268 seconds precompiling for 22 choices 2025-12-04T09:58:55.9114739Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9114781Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9114820Z unimplemented [] 2025-12-04T09:58:55.9114881Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9114980Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9115554Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9115601Z graph_break [] 2025-12-04T09:58:55.9115676Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9115715Z Autotune Choices Stats: 2025-12-04T09:58:55.9116530Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1432", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009560000151395798, "best_triton_pos": 0} 2025-12-04T09:58:55.9116659Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9116772Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9116933Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9117542Z triton_flex_attention_1432 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9118183Z triton_flex_attention_1430 0.0100 ms 95.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9118791Z triton_flex_attention_1433 0.0116 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9119393Z triton_flex_attention_1431 0.0122 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9119996Z triton_flex_attention_1448 0.0128 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9120610Z triton_flex_attention_1440 0.0136 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9121216Z triton_flex_attention_1446 0.0142 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9121816Z triton_flex_attention_1438 0.0147 ms 65.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9122445Z triton_flex_attention_1429 0.0163 ms 58.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9123047Z triton_flex_attention_1444 0.0165 ms 58.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9123178Z SingleProcess AUTOTUNE benchmarking takes 0.2307 seconds and 0.4499 seconds precompiling for 24 choices 2025-12-04T09:58:55.9123218Z Autotune Choices Stats: 2025-12-04T09:58:55.9123976Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1467", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.9124200Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9124366Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9124642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9125280Z triton_flex_attention_backward_1467 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9125908Z triton_flex_attention_backward_1461 0.0182 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9126612Z triton_flex_attention_backward_1459 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9127235Z triton_flex_attention_backward_1458 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9127868Z triton_flex_attention_backward_1469 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9128495Z triton_flex_attention_backward_1468 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9129143Z triton_flex_attention_backward_1466 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9129770Z triton_flex_attention_backward_1471 0.0221 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9130397Z triton_flex_attention_backward_1462 0.0229 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9131056Z triton_flex_attention_backward_1453 0.0230 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9131184Z SingleProcess AUTOTUNE benchmarking takes 0.2787 seconds and 0.9129 seconds precompiling for 22 choices 2025-12-04T09:58:55.9131258Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9131301Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9131339Z unimplemented [] 2025-12-04T09:58:55.9131402Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9131501Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9132079Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 70), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 24), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 8), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9132116Z graph_break [] 2025-12-04T09:58:55.9132191Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9132231Z Autotune Choices Stats: 2025-12-04T09:58:55.9132972Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1478", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009518999606370926, "best_triton_pos": 0} 2025-12-04T09:58:55.9133108Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9133221Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9133384Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9134003Z triton_flex_attention_1478 0.0095 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9134610Z triton_flex_attention_1479 0.0104 ms 91.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9135243Z triton_flex_attention_1474 0.0115 ms 82.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9135844Z triton_flex_attention_1477 0.0120 ms 79.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9136484Z triton_flex_attention_1476 0.0121 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9137086Z triton_flex_attention_1475 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9137703Z triton_flex_attention_1494 0.0133 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9138306Z triton_flex_attention_1486 0.0136 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9138913Z triton_flex_attention_1492 0.0144 ms 65.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9139550Z triton_flex_attention_1484 0.0149 ms 63.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9139679Z SingleProcess AUTOTUNE benchmarking takes 0.2165 seconds and 0.4348 seconds precompiling for 24 choices 2025-12-04T09:58:55.9139720Z Autotune Choices Stats: 2025-12-04T09:58:55.9140471Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1513", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.9140687Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9140853Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9141138Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9141762Z triton_flex_attention_backward_1513 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9142386Z triton_flex_attention_backward_1507 0.0180 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9143011Z triton_flex_attention_backward_1504 0.0187 ms 83.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9143667Z triton_flex_attention_backward_1505 0.0188 ms 82.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9144291Z triton_flex_attention_backward_1515 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9144917Z triton_flex_attention_backward_1514 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9145543Z triton_flex_attention_backward_1512 0.0216 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9146215Z triton_flex_attention_backward_1517 0.0220 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9146833Z triton_flex_attention_backward_1499 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9147460Z triton_flex_attention_backward_1508 0.0228 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9147604Z SingleProcess AUTOTUNE benchmarking takes 0.2781 seconds and 0.9120 seconds precompiling for 22 choices 2025-12-04T09:58:55.9147678Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9147746Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9147785Z unimplemented [] 2025-12-04T09:58:55.9147846Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9147945Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9148519Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.9148557Z graph_break [] 2025-12-04T09:58:55.9148631Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9148672Z Autotune Choices Stats: 2025-12-04T09:58:55.9149413Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1524", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008958999998867512, "best_triton_pos": 0} 2025-12-04T09:58:55.9149554Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9149669Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9149831Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9150447Z triton_flex_attention_1524 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9151053Z triton_flex_attention_1525 0.0099 ms 90.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9151661Z triton_flex_attention_1523 0.0116 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9152299Z triton_flex_attention_1520 0.0117 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9152899Z triton_flex_attention_1521 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9153499Z triton_flex_attention_1522 0.0128 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9154116Z triton_flex_attention_1540 0.0131 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9154726Z triton_flex_attention_1532 0.0137 ms 65.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9155330Z triton_flex_attention_1538 0.0143 ms 62.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9155967Z triton_flex_attention_1530 0.0147 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9156124Z SingleProcess AUTOTUNE benchmarking takes 0.2200 seconds and 0.4249 seconds precompiling for 24 choices 2025-12-04T09:58:55.9156164Z Autotune Choices Stats: 2025-12-04T09:58:55.9156958Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1559", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015639999881386757, "best_triton_pos": 0} 2025-12-04T09:58:55.9157176Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9157340Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9157614Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9158249Z triton_flex_attention_backward_1559 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9158893Z triton_flex_attention_backward_1553 0.0183 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9159516Z triton_flex_attention_backward_1550 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9160138Z triton_flex_attention_backward_1551 0.0189 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9160806Z triton_flex_attention_backward_1560 0.0200 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9161432Z triton_flex_attention_backward_1561 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9162062Z triton_flex_attention_backward_1563 0.0217 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9162695Z triton_flex_attention_backward_1558 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9163322Z triton_flex_attention_backward_1554 0.0224 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9163943Z triton_flex_attention_backward_1545 0.0230 ms 67.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9164071Z SingleProcess AUTOTUNE benchmarking takes 0.2604 seconds and 0.8737 seconds precompiling for 22 choices 2025-12-04T09:58:55.9164157Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9164201Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9164237Z unimplemented [] 2025-12-04T09:58:55.9164299Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9164399Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9164988Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9165028Z graph_break [] 2025-12-04T09:58:55.9165102Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9165142Z Autotune Choices Stats: 2025-12-04T09:58:55.9165886Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1570", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.9166045Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9166160Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9166340Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9166953Z triton_flex_attention_1570 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9167558Z triton_flex_attention_1568 0.0102 ms 87.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9168164Z triton_flex_attention_1569 0.0113 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9168778Z triton_flex_attention_1567 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9169412Z triton_flex_attention_1586 0.0130 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9170020Z triton_flex_attention_1578 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9170624Z triton_flex_attention_1584 0.0141 ms 63.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9171234Z triton_flex_attention_1566 0.0143 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9171838Z triton_flex_attention_1571 0.0144 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9172439Z triton_flex_attention_1576 0.0147 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9172568Z SingleProcess AUTOTUNE benchmarking takes 0.2371 seconds and 0.4264 seconds precompiling for 24 choices 2025-12-04T09:58:55.9172619Z Autotune Choices Stats: 2025-12-04T09:58:55.9173387Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1605", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015720000490546227, "best_triton_pos": 0} 2025-12-04T09:58:55.9173603Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9173769Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9174052Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9174685Z triton_flex_attention_backward_1605 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9175318Z triton_flex_attention_backward_1599 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9175977Z triton_flex_attention_backward_1596 0.0188 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9176611Z triton_flex_attention_backward_1597 0.0188 ms 83.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9177237Z triton_flex_attention_backward_1607 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9177925Z triton_flex_attention_backward_1606 0.0204 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9178553Z triton_flex_attention_backward_1604 0.0217 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9179182Z triton_flex_attention_backward_1609 0.0221 ms 71.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9179827Z triton_flex_attention_backward_1600 0.0229 ms 68.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9180446Z triton_flex_attention_backward_1591 0.0232 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9180579Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.7388 seconds precompiling for 22 choices 2025-12-04T09:58:55.9180656Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9180698Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9180735Z unimplemented [] 2025-12-04T09:58:55.9180796Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9180896Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9181477Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.9181516Z graph_break [] 2025-12-04T09:58:55.9181599Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9181640Z Autotune Choices Stats: 2025-12-04T09:58:55.9182385Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1614", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010280000045895576, "best_triton_pos": 0} 2025-12-04T09:58:55.9182516Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9182632Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9182792Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9183399Z triton_flex_attention_1614 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9184012Z triton_flex_attention_1612 0.0114 ms 90.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9184611Z triton_flex_attention_1615 0.0117 ms 87.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9185221Z triton_flex_attention_1616 0.0121 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9185848Z triton_flex_attention_1632 0.0132 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9186500Z triton_flex_attention_1613 0.0133 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9187103Z triton_flex_attention_1624 0.0136 ms 75.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9187709Z triton_flex_attention_1617 0.0139 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9188323Z triton_flex_attention_1630 0.0142 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9188922Z triton_flex_attention_1622 0.0150 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9189049Z SingleProcess AUTOTUNE benchmarking takes 0.2358 seconds and 0.4515 seconds precompiling for 24 choices 2025-12-04T09:58:55.9189090Z Autotune Choices Stats: 2025-12-04T09:58:55.9189845Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1651", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01579900085926056, "best_triton_pos": 0} 2025-12-04T09:58:55.9190075Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9190263Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9190540Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9191172Z triton_flex_attention_backward_1651 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9191799Z triton_flex_attention_backward_1645 0.0182 ms 86.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9192435Z triton_flex_attention_backward_1642 0.0186 ms 85.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9193056Z triton_flex_attention_backward_1643 0.0187 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9193682Z triton_flex_attention_backward_1653 0.0201 ms 78.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9194327Z triton_flex_attention_backward_1652 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9194969Z triton_flex_attention_backward_1650 0.0218 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9195599Z triton_flex_attention_backward_1655 0.0220 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9196271Z triton_flex_attention_backward_1646 0.0227 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9196915Z triton_flex_attention_backward_1637 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9197043Z SingleProcess AUTOTUNE benchmarking takes 0.2701 seconds and 0.8619 seconds precompiling for 22 choices 2025-12-04T09:58:55.9197119Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9197161Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9197198Z unimplemented [] 2025-12-04T09:58:55.9197259Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9197360Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9197932Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.9197990Z graph_break [] 2025-12-04T09:58:55.9198063Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9198104Z Autotune Choices Stats: 2025-12-04T09:58:55.9198869Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1660", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009719000197947025, "best_triton_pos": 0} 2025-12-04T09:58:55.9198995Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9199110Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9199271Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9199893Z triton_flex_attention_1660 0.0097 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9200496Z triton_flex_attention_1662 0.0104 ms 93.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9201114Z triton_flex_attention_1661 0.0118 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9201715Z triton_flex_attention_1678 0.0128 ms 76.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9202316Z triton_flex_attention_1659 0.0130 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9202951Z triton_flex_attention_1663 0.0130 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9203553Z triton_flex_attention_1670 0.0136 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9204156Z triton_flex_attention_1676 0.0144 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9204755Z triton_flex_attention_1668 0.0147 ms 66.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9205372Z triton_flex_attention_1674 0.0164 ms 59.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9205500Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.5166 seconds precompiling for 24 choices 2025-12-04T09:58:55.9205540Z Autotune Choices Stats: 2025-12-04T09:58:55.9206349Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1697", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.9206569Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9206755Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9207030Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9207707Z triton_flex_attention_backward_1697 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9208333Z triton_flex_attention_backward_1691 0.0184 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9208958Z triton_flex_attention_backward_1688 0.0187 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9209597Z triton_flex_attention_backward_1689 0.0188 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9210229Z triton_flex_attention_backward_1699 0.0198 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9210860Z triton_flex_attention_backward_1698 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9211518Z triton_flex_attention_backward_1696 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9212165Z triton_flex_attention_backward_1701 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9212793Z triton_flex_attention_backward_1692 0.0227 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9213415Z triton_flex_attention_backward_1683 0.0232 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9213556Z SingleProcess AUTOTUNE benchmarking takes 0.2560 seconds and 0.8401 seconds precompiling for 22 choices 2025-12-04T09:58:55.9213634Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9213677Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9213718Z unimplemented [] 2025-12-04T09:58:55.9213778Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9213878Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9214455Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9214492Z graph_break [] 2025-12-04T09:58:55.9214568Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9214610Z Autotune Choices Stats: 2025-12-04T09:58:55.9215352Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1708", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.9215487Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9215622Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9215784Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9216435Z triton_flex_attention_1708 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9217039Z triton_flex_attention_1709 0.0109 ms 96.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9217636Z triton_flex_attention_1707 0.0117 ms 89.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9218255Z triton_flex_attention_1705 0.0130 ms 80.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9218865Z triton_flex_attention_1724 0.0135 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9219467Z triton_flex_attention_1706 0.0136 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9220120Z triton_flex_attention_1716 0.0142 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9220722Z triton_flex_attention_1722 0.0143 ms 73.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9221330Z triton_flex_attention_1714 0.0149 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9221934Z triton_flex_attention_1720 0.0162 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9222075Z SingleProcess AUTOTUNE benchmarking takes 0.2434 seconds and 0.4106 seconds precompiling for 24 choices 2025-12-04T09:58:55.9222119Z Autotune Choices Stats: 2025-12-04T09:58:55.9222882Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1743", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015359999611973763, "best_triton_pos": 0} 2025-12-04T09:58:55.9223100Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9223266Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9223543Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9224210Z triton_flex_attention_backward_1743 0.0154 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9224834Z triton_flex_attention_backward_1737 0.0181 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9225463Z triton_flex_attention_backward_1734 0.0187 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9226128Z triton_flex_attention_backward_1735 0.0188 ms 81.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9226769Z triton_flex_attention_backward_1745 0.0203 ms 75.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9227398Z triton_flex_attention_backward_1744 0.0203 ms 75.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9228022Z triton_flex_attention_backward_1742 0.0218 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9228689Z triton_flex_attention_backward_1747 0.0220 ms 69.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9229312Z triton_flex_attention_backward_1738 0.0228 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9229941Z triton_flex_attention_backward_1729 0.0230 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9230070Z SingleProcess AUTOTUNE benchmarking takes 0.2527 seconds and 0.7882 seconds precompiling for 22 choices 2025-12-04T09:58:55.9230145Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9230199Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9230239Z unimplemented [] 2025-12-04T09:58:55.9230300Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9230398Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9230975Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.9231013Z graph_break [] 2025-12-04T09:58:55.9231090Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9231130Z Autotune Choices Stats: 2025-12-04T09:58:55.9231873Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1754", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009398999623954296, "best_triton_pos": 0} 2025-12-04T09:58:55.9232003Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9232130Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9232292Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9232915Z triton_flex_attention_1754 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9233513Z triton_flex_attention_1755 0.0104 ms 90.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9234120Z triton_flex_attention_1752 0.0112 ms 84.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9234723Z triton_flex_attention_1753 0.0117 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9235334Z triton_flex_attention_1750 0.0120 ms 78.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9235988Z triton_flex_attention_1770 0.0132 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9236593Z triton_flex_attention_1751 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9237235Z triton_flex_attention_1762 0.0140 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9237835Z triton_flex_attention_1768 0.0146 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9238441Z triton_flex_attention_1760 0.0150 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9238571Z SingleProcess AUTOTUNE benchmarking takes 0.2227 seconds and 0.4678 seconds precompiling for 24 choices 2025-12-04T09:58:55.9238610Z Autotune Choices Stats: 2025-12-04T09:58:55.9239385Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1789", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.9239601Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9239766Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9240044Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9240680Z triton_flex_attention_backward_1789 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9241341Z triton_flex_attention_backward_1783 0.0184 ms 85.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9241964Z triton_flex_attention_backward_1780 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9242590Z triton_flex_attention_backward_1781 0.0187 ms 83.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9243219Z triton_flex_attention_backward_1791 0.0202 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9243857Z triton_flex_attention_backward_1790 0.0204 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9244480Z triton_flex_attention_backward_1788 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9245106Z triton_flex_attention_backward_1793 0.0219 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9245761Z triton_flex_attention_backward_1784 0.0226 ms 69.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9246414Z triton_flex_attention_backward_1775 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9246543Z SingleProcess AUTOTUNE benchmarking takes 0.2632 seconds and 0.8758 seconds precompiling for 22 choices 2025-12-04T09:58:55.9246620Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9246662Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9246698Z unimplemented [] 2025-12-04T09:58:55.9246760Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9246858Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9247429Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 69), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 23), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9247482Z graph_break [] 2025-12-04T09:58:55.9247560Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9247601Z Autotune Choices Stats: 2025-12-04T09:58:55.9248345Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1801", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010479999706149101, "best_triton_pos": 0} 2025-12-04T09:58:55.9248475Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9248591Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9248751Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9249359Z triton_flex_attention_1801 0.0105 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9249998Z triton_flex_attention_1800 0.0108 ms 97.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9250602Z triton_flex_attention_1816 0.0128 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9251205Z triton_flex_attention_1798 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9251818Z triton_flex_attention_1797 0.0130 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9252419Z triton_flex_attention_1808 0.0133 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9253021Z triton_flex_attention_1814 0.0140 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9253620Z triton_flex_attention_1806 0.0150 ms 69.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9254254Z triton_flex_attention_1799 0.0158 ms 66.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9254857Z triton_flex_attention_1812 0.0164 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9254991Z SingleProcess AUTOTUNE benchmarking takes 0.2483 seconds and 0.4169 seconds precompiling for 24 choices 2025-12-04T09:58:55.9255033Z Autotune Choices Stats: 2025-12-04T09:58:55.9255792Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1835", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01576000079512596, "best_triton_pos": 0} 2025-12-04T09:58:55.9256061Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9256232Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9256511Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9257138Z triton_flex_attention_backward_1835 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9257753Z triton_flex_attention_backward_1829 0.0184 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9258413Z triton_flex_attention_backward_1826 0.0186 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9259040Z triton_flex_attention_backward_1827 0.0186 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9259671Z triton_flex_attention_backward_1837 0.0202 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9260295Z triton_flex_attention_backward_1836 0.0202 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9260937Z triton_flex_attention_backward_1834 0.0219 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9261568Z triton_flex_attention_backward_1839 0.0221 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9262192Z triton_flex_attention_backward_1830 0.0228 ms 69.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9262843Z triton_flex_attention_backward_1821 0.0230 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9262974Z SingleProcess AUTOTUNE benchmarking takes 0.2624 seconds and 0.8439 seconds precompiling for 22 choices 2025-12-04T09:58:55.9263047Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9263091Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9263128Z unimplemented [] 2025-12-04T09:58:55.9263189Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9263288Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9263858Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9263897Z graph_break [] 2025-12-04T09:58:55.9263972Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9264023Z Autotune Choices Stats: 2025-12-04T09:58:55.9264756Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1846", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009399999864399433, "best_triton_pos": 0} 2025-12-04T09:58:55.9264882Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9264997Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9265161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9265776Z triton_flex_attention_1846 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9266420Z triton_flex_attention_1844 0.0102 ms 91.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9267052Z triton_flex_attention_1845 0.0120 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9267658Z triton_flex_attention_1843 0.0130 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9268258Z triton_flex_attention_1854 0.0132 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9268875Z triton_flex_attention_1862 0.0134 ms 70.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9269474Z triton_flex_attention_1842 0.0137 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9270082Z triton_flex_attention_1847 0.0138 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9270684Z triton_flex_attention_1860 0.0144 ms 65.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9271324Z triton_flex_attention_1852 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9271452Z SingleProcess AUTOTUNE benchmarking takes 0.2274 seconds and 0.3833 seconds precompiling for 24 choices 2025-12-04T09:58:55.9271494Z Autotune Choices Stats: 2025-12-04T09:58:55.9272262Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1881", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01583999954164028, "best_triton_pos": 0} 2025-12-04T09:58:55.9272478Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9272642Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9272931Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9273558Z triton_flex_attention_backward_1881 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9274187Z triton_flex_attention_backward_1875 0.0184 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9274813Z triton_flex_attention_backward_1873 0.0187 ms 84.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9275467Z triton_flex_attention_backward_1872 0.0188 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9276125Z triton_flex_attention_backward_1883 0.0201 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9276748Z triton_flex_attention_backward_1882 0.0202 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9277384Z triton_flex_attention_backward_1880 0.0220 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9278012Z triton_flex_attention_backward_1885 0.0220 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9278643Z triton_flex_attention_backward_1876 0.0224 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9279258Z triton_flex_attention_backward_1867 0.0232 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9279415Z SingleProcess AUTOTUNE benchmarking takes 0.2681 seconds and 0.7872 seconds precompiling for 22 choices 2025-12-04T09:58:55.9279505Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9279548Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9279585Z unimplemented [] 2025-12-04T09:58:55.9279647Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9279748Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9280320Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9280359Z graph_break [] 2025-12-04T09:58:55.9280434Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9280474Z Autotune Choices Stats: 2025-12-04T09:58:55.9281217Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1893", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010040000081062317, "best_triton_pos": 0} 2025-12-04T09:58:55.9281356Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9281471Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9281632Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9282238Z triton_flex_attention_1893 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9282845Z triton_flex_attention_1892 0.0106 ms 95.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9283458Z triton_flex_attention_1891 0.0117 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9284077Z triton_flex_attention_1890 0.0127 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9284682Z triton_flex_attention_1908 0.0130 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9285286Z triton_flex_attention_1889 0.0132 ms 75.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9285899Z triton_flex_attention_1900 0.0135 ms 74.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9286528Z triton_flex_attention_1906 0.0140 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9287131Z triton_flex_attention_1898 0.0148 ms 67.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9287749Z triton_flex_attention_1904 0.0162 ms 61.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9287901Z SingleProcess AUTOTUNE benchmarking takes 0.2478 seconds and 0.5052 seconds precompiling for 24 choices 2025-12-04T09:58:55.9287955Z Autotune Choices Stats: 2025-12-04T09:58:55.9288705Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1927", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015519999898970127, "best_triton_pos": 0} 2025-12-04T09:58:55.9288928Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9289097Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9289372Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9290020Z triton_flex_attention_backward_1927 0.0155 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9290640Z triton_flex_attention_backward_1921 0.0183 ms 84.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9291262Z triton_flex_attention_backward_1918 0.0185 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9291884Z triton_flex_attention_backward_1919 0.0186 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9292544Z triton_flex_attention_backward_1929 0.0201 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9293170Z triton_flex_attention_backward_1928 0.0202 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9293790Z triton_flex_attention_backward_1926 0.0217 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9294433Z triton_flex_attention_backward_1931 0.0220 ms 70.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9295059Z triton_flex_attention_backward_1922 0.0227 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9295681Z triton_flex_attention_backward_1913 0.0230 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9295819Z SingleProcess AUTOTUNE benchmarking takes 0.2709 seconds and 0.9165 seconds precompiling for 22 choices 2025-12-04T09:58:55.9295894Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9295965Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9296004Z unimplemented [] 2025-12-04T09:58:55.9296066Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9296168Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9296766Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9296805Z graph_break [] 2025-12-04T09:58:55.9296878Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9296922Z Autotune Choices Stats: 2025-12-04T09:58:55.9297664Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1938", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009960000403225422, "best_triton_pos": 0} 2025-12-04T09:58:55.9297790Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9297904Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9298076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9298697Z triton_flex_attention_1938 0.0100 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9299299Z triton_flex_attention_1936 0.0100 ms 99.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9299901Z triton_flex_attention_1939 0.0101 ms 98.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9300536Z triton_flex_attention_1935 0.0129 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9301144Z triton_flex_attention_1937 0.0134 ms 74.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9301747Z triton_flex_attention_1946 0.0137 ms 72.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9302351Z triton_flex_attention_1954 0.0139 ms 71.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9302972Z triton_flex_attention_1952 0.0146 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9303571Z triton_flex_attention_1944 0.0151 ms 66.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9304173Z triton_flex_attention_1950 0.0165 ms 60.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9304316Z SingleProcess AUTOTUNE benchmarking takes 0.2498 seconds and 0.4270 seconds precompiling for 24 choices 2025-12-04T09:58:55.9304362Z Autotune Choices Stats: 2025-12-04T09:58:55.9305143Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_1973", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015678999945521355, "best_triton_pos": 0} 2025-12-04T09:58:55.9305362Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9305528Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9305806Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9306472Z triton_flex_attention_backward_1973 0.0157 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9307116Z triton_flex_attention_backward_1967 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9307739Z triton_flex_attention_backward_1964 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9308364Z triton_flex_attention_backward_1965 0.0187 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9309008Z triton_flex_attention_backward_1975 0.0199 ms 78.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9309665Z triton_flex_attention_backward_1974 0.0201 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9310290Z triton_flex_attention_backward_1972 0.0216 ms 72.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9310916Z triton_flex_attention_backward_1977 0.0220 ms 71.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9311554Z triton_flex_attention_backward_1968 0.0226 ms 69.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9312180Z triton_flex_attention_backward_1959 0.0228 ms 68.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9312310Z SingleProcess AUTOTUNE benchmarking takes 0.2677 seconds and 0.8736 seconds precompiling for 22 choices 2025-12-04T09:58:55.9312386Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9312429Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9312467Z unimplemented [] 2025-12-04T09:58:55.9312527Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9312639Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9313208Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9313256Z graph_break [] 2025-12-04T09:58:55.9313342Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9313381Z Autotune Choices Stats: 2025-12-04T09:58:55.9314119Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_1984", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009600000455975533, "best_triton_pos": 0} 2025-12-04T09:58:55.9314246Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9314362Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9314522Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9315132Z triton_flex_attention_1984 0.0096 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9315743Z triton_flex_attention_1982 0.0101 ms 94.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9316393Z triton_flex_attention_1983 0.0116 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9316996Z triton_flex_attention_2000 0.0130 ms 73.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9317634Z triton_flex_attention_1985 0.0132 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9318233Z triton_flex_attention_1981 0.0133 ms 72.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9318841Z triton_flex_attention_1992 0.0137 ms 70.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9319441Z triton_flex_attention_1998 0.0140 ms 68.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9320066Z triton_flex_attention_1990 0.0150 ms 64.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9320670Z triton_flex_attention_1996 0.0162 ms 59.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9320801Z SingleProcess AUTOTUNE benchmarking takes 0.2470 seconds and 0.3620 seconds precompiling for 24 choices 2025-12-04T09:58:55.9320843Z Autotune Choices Stats: 2025-12-04T09:58:55.9321597Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2019", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.9321835Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9322010Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9322289Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9322929Z triton_flex_attention_backward_2019 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9323553Z triton_flex_attention_backward_2013 0.0182 ms 85.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9324187Z triton_flex_attention_backward_2010 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9324808Z triton_flex_attention_backward_2011 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9325432Z triton_flex_attention_backward_2021 0.0202 ms 77.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9326140Z triton_flex_attention_backward_2020 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9326762Z triton_flex_attention_backward_2018 0.0219 ms 71.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9327393Z triton_flex_attention_backward_2023 0.0222 ms 70.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9328019Z triton_flex_attention_backward_2014 0.0228 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9328657Z triton_flex_attention_backward_2005 0.0232 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9328787Z SingleProcess AUTOTUNE benchmarking takes 0.2594 seconds and 0.8715 seconds precompiling for 22 choices 2025-12-04T09:58:55.9328861Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9328904Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9328942Z unimplemented [] 2025-12-04T09:58:55.9329003Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9329103Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9329679Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9329731Z graph_break [] 2025-12-04T09:58:55.9329809Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9329848Z Autotune Choices Stats: 2025-12-04T09:58:55.9330604Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2030", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009279999881982803, "best_triton_pos": 0} 2025-12-04T09:58:55.9330733Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9330849Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9331010Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9331624Z triton_flex_attention_2030 0.0093 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9332233Z triton_flex_attention_2031 0.0108 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9332843Z triton_flex_attention_2026 0.0112 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9333443Z triton_flex_attention_2028 0.0113 ms 82.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9334047Z triton_flex_attention_2029 0.0116 ms 79.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9334689Z triton_flex_attention_2046 0.0132 ms 70.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9335290Z triton_flex_attention_2027 0.0132 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9335896Z triton_flex_attention_2038 0.0134 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9336529Z triton_flex_attention_2044 0.0144 ms 64.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9337147Z triton_flex_attention_2024 0.0147 ms 63.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9337276Z SingleProcess AUTOTUNE benchmarking takes 0.1936 seconds and 0.4021 seconds precompiling for 24 choices 2025-12-04T09:58:55.9337316Z Autotune Choices Stats: 2025-12-04T09:58:55.9338071Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2065", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.9338300Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9338469Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9338772Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9339401Z triton_flex_attention_backward_2065 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9340024Z triton_flex_attention_backward_2059 0.0182 ms 85.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9340641Z triton_flex_attention_backward_2056 0.0186 ms 83.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9341275Z triton_flex_attention_backward_2057 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9341899Z triton_flex_attention_backward_2066 0.0200 ms 78.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9342529Z triton_flex_attention_backward_2067 0.0200 ms 77.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9343184Z triton_flex_attention_backward_2064 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9343809Z triton_flex_attention_backward_2069 0.0218 ms 71.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9344436Z triton_flex_attention_backward_2060 0.0224 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9345062Z triton_flex_attention_backward_2051 0.0230 ms 67.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9345204Z SingleProcess AUTOTUNE benchmarking takes 0.2678 seconds and 0.8209 seconds precompiling for 22 choices 2025-12-04T09:58:55.9345278Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9345322Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9345358Z unimplemented [] 2025-12-04T09:58:55.9345422Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9345521Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9346124Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.9346161Z graph_break [] 2025-12-04T09:58:55.9346240Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9346279Z Autotune Choices Stats: 2025-12-04T09:58:55.9347018Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2077", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8", "best_time": 0.010320000350475311, "best_triton_pos": 0} 2025-12-04T09:58:55.9347182Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9347309Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9347472Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9348088Z triton_flex_attention_2077 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9348693Z triton_flex_attention_2074 0.0118 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9349297Z triton_flex_attention_2076 0.0128 ms 80.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9349919Z triton_flex_attention_2073 0.0130 ms 79.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9350523Z triton_flex_attention_2084 0.0136 ms 75.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9351127Z triton_flex_attention_2092 0.0139 ms 74.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9351768Z triton_flex_attention_2090 0.0144 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9352367Z triton_flex_attention_2082 0.0150 ms 69.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9352967Z triton_flex_attention_2075 0.0154 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9353568Z triton_flex_attention_2088 0.0165 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9353709Z SingleProcess AUTOTUNE benchmarking takes 0.2499 seconds and 0.3908 seconds precompiling for 24 choices 2025-12-04T09:58:55.9353749Z Autotune Choices Stats: 2025-12-04T09:58:55.9354508Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2111", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01563899964094162, "best_triton_pos": 0} 2025-12-04T09:58:55.9354728Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9354892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9355180Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9355832Z triton_flex_attention_backward_2111 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9356489Z triton_flex_attention_backward_2105 0.0181 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9357115Z triton_flex_attention_backward_2110 0.0181 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9357738Z triton_flex_attention_backward_2102 0.0186 ms 84.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9358387Z triton_flex_attention_backward_2103 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9359021Z triton_flex_attention_backward_2113 0.0203 ms 77.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9359645Z triton_flex_attention_backward_2112 0.0204 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9360311Z triton_flex_attention_backward_2115 0.0221 ms 70.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9360937Z triton_flex_attention_backward_2097 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9362823Z triton_flex_attention_backward_2106 0.0230 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9362959Z SingleProcess AUTOTUNE benchmarking takes 0.4709 seconds and 0.7187 seconds precompiling for 22 choices 2025-12-04T09:58:55.9363049Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9363097Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9363134Z unimplemented [] 2025-12-04T09:58:55.9363198Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9363298Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9363887Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9363926Z graph_break [] 2025-12-04T09:58:55.9364004Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9364045Z Autotune Choices Stats: 2025-12-04T09:58:55.9364790Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2122", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008960000239312649, "best_triton_pos": 0} 2025-12-04T09:58:55.9364935Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9365050Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9365215Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9365839Z triton_flex_attention_2122 0.0090 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9366486Z triton_flex_attention_2123 0.0100 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9367138Z triton_flex_attention_2119 0.0129 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9367757Z triton_flex_attention_2121 0.0133 ms 67.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9368366Z triton_flex_attention_2138 0.0134 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9368984Z triton_flex_attention_2130 0.0139 ms 64.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9369592Z triton_flex_attention_2120 0.0142 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9370228Z triton_flex_attention_2136 0.0145 ms 61.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9370837Z triton_flex_attention_2128 0.0149 ms 60.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9371475Z triton_flex_attention_2134 0.0166 ms 53.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9371608Z SingleProcess AUTOTUNE benchmarking takes 0.2470 seconds and 0.4797 seconds precompiling for 24 choices 2025-12-04T09:58:55.9371659Z Autotune Choices Stats: 2025-12-04T09:58:55.9372423Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2157", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015599999576807022, "best_triton_pos": 0} 2025-12-04T09:58:55.9372644Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9372810Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9373095Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9373735Z triton_flex_attention_backward_2157 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9374390Z triton_flex_attention_backward_2151 0.0182 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9375022Z triton_flex_attention_backward_2149 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9375673Z triton_flex_attention_backward_2148 0.0188 ms 83.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9376338Z triton_flex_attention_backward_2159 0.0202 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9376992Z triton_flex_attention_backward_2158 0.0203 ms 76.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9377617Z triton_flex_attention_backward_2156 0.0216 ms 72.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9378255Z triton_flex_attention_backward_2161 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9378900Z triton_flex_attention_backward_2152 0.0228 ms 68.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9379528Z triton_flex_attention_backward_2143 0.0232 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9379659Z SingleProcess AUTOTUNE benchmarking takes 0.2555 seconds and 0.9394 seconds precompiling for 22 choices 2025-12-04T09:58:55.9379735Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9379781Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9379817Z unimplemented [] 2025-12-04T09:58:55.9379879Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9379994Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9380665Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 65), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9380704Z graph_break [] 2025-12-04T09:58:55.9380777Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9380819Z Autotune Choices Stats: 2025-12-04T09:58:55.9381553Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2168", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009200000204145908, "best_triton_pos": 0} 2025-12-04T09:58:55.9381686Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9381802Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9381964Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9382596Z triton_flex_attention_2168 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9383214Z triton_flex_attention_2166 0.0101 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9383825Z triton_flex_attention_2169 0.0104 ms 88.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9384446Z triton_flex_attention_2167 0.0113 ms 81.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9385069Z triton_flex_attention_2184 0.0132 ms 69.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9385674Z triton_flex_attention_2165 0.0133 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9386313Z triton_flex_attention_2176 0.0135 ms 68.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9386917Z triton_flex_attention_2182 0.0140 ms 65.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9387553Z triton_flex_attention_2174 0.0150 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9388156Z triton_flex_attention_2180 0.0164 ms 56.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9388288Z SingleProcess AUTOTUNE benchmarking takes 0.2350 seconds and 0.4301 seconds precompiling for 24 choices 2025-12-04T09:58:55.9388329Z Autotune Choices Stats: 2025-12-04T09:58:55.9389103Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2203", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015560000203549862, "best_triton_pos": 0} 2025-12-04T09:58:55.9389336Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9389501Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9389783Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9390419Z triton_flex_attention_backward_2203 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9391055Z triton_flex_attention_backward_2197 0.0181 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9391711Z triton_flex_attention_backward_2195 0.0186 ms 83.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9392331Z triton_flex_attention_backward_2194 0.0187 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9392974Z triton_flex_attention_backward_2205 0.0202 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9393616Z triton_flex_attention_backward_2204 0.0203 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9394243Z triton_flex_attention_backward_2202 0.0217 ms 71.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9394876Z triton_flex_attention_backward_2207 0.0219 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9395509Z triton_flex_attention_backward_2198 0.0227 ms 68.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9396210Z triton_flex_attention_backward_2189 0.0230 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9396341Z SingleProcess AUTOTUNE benchmarking takes 0.2634 seconds and 0.7312 seconds precompiling for 22 choices 2025-12-04T09:58:55.9396419Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9396462Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9396499Z unimplemented [] 2025-12-04T09:58:55.9396560Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9396660Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9397248Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 22), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T09:58:55.9397302Z graph_break [] 2025-12-04T09:58:55.9397377Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9397418Z Autotune Choices Stats: 2025-12-04T09:58:55.9398168Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2212", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.009920000098645687, "best_triton_pos": 0} 2025-12-04T09:58:55.9398299Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9398417Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9398580Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9399194Z triton_flex_attention_2212 0.0099 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9399829Z triton_flex_attention_2214 0.0108 ms 92.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9400440Z triton_flex_attention_2213 0.0111 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9401050Z triton_flex_attention_2230 0.0128 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9401665Z triton_flex_attention_2211 0.0128 ms 77.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9402290Z triton_flex_attention_2222 0.0133 ms 74.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9402900Z triton_flex_attention_2215 0.0134 ms 74.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9403511Z triton_flex_attention_2228 0.0143 ms 69.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9404123Z triton_flex_attention_2220 0.0147 ms 67.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9404740Z triton_flex_attention_2226 0.0164 ms 60.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9404875Z SingleProcess AUTOTUNE benchmarking takes 0.2288 seconds and 0.3817 seconds precompiling for 24 choices 2025-12-04T09:58:55.9404917Z Autotune Choices Stats: 2025-12-04T09:58:55.9405686Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2249", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.015759000554680824, "best_triton_pos": 0} 2025-12-04T09:58:55.9405916Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9406129Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9406412Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9407057Z triton_flex_attention_backward_2249 0.0158 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9407684Z triton_flex_attention_backward_2243 0.0184 ms 85.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9408330Z triton_flex_attention_backward_2241 0.0186 ms 84.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9408974Z triton_flex_attention_backward_2240 0.0187 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9409607Z triton_flex_attention_backward_2251 0.0199 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9410250Z triton_flex_attention_backward_2250 0.0201 ms 78.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9410891Z triton_flex_attention_backward_2253 0.0218 ms 72.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9411522Z triton_flex_attention_backward_2248 0.0219 ms 72.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9412150Z triton_flex_attention_backward_2244 0.0224 ms 70.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9412793Z triton_flex_attention_backward_2235 0.0229 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9412925Z SingleProcess AUTOTUNE benchmarking takes 0.2552 seconds and 0.7055 seconds precompiling for 22 choices 2025-12-04T09:58:55.9413002Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T09:58:55.9413044Z frames [('total', 2), ('ok', 2)] 2025-12-04T09:58:55.9413085Z unimplemented [] 2025-12-04T09:58:55.9413148Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T09:58:55.9413250Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T09:58:55.9413825Z inductor [('triton_bundler_save_kernel', 528), ('benchmarking.InductorBenchmarker.benchmark_gpu', 64), ('async_compile_cache_miss', 62), ('select_algorithm_num_precompiles', 46), ('benchmarking.InductorBenchmarker.benchmark', 18), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('coordesc_tuning_bench', 2)] 2025-12-04T09:58:55.9413864Z graph_break [] 2025-12-04T09:58:55.9413939Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T09:58:55.9413981Z Autotune Choices Stats: 2025-12-04T09:58:55.9414736Z {"num_choices": 24, "num_triton_choices": 24, "best_kernel": "triton_flex_attention_2260", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.008919999934732914, "best_triton_pos": 0} 2025-12-04T09:58:55.9414875Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9414990Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9415151Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9415777Z triton_flex_attention_2260 0.0089 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9416404Z triton_flex_attention_2258 0.0104 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9417043Z triton_flex_attention_2261 0.0113 ms 78.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9417642Z triton_flex_attention_2259 0.0115 ms 77.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9418252Z triton_flex_attention_2257 0.0131 ms 68.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=16, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9418874Z triton_flex_attention_2276 0.0133 ms 67.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9419492Z triton_flex_attention_2268 0.0136 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9420097Z triton_flex_attention_2274 0.0138 ms 64.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9420705Z triton_flex_attention_2266 0.0148 ms 60.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9421334Z triton_flex_attention_2272 0.0164 ms 54.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9421463Z SingleProcess AUTOTUNE benchmarking takes 0.2474 seconds and 0.4395 seconds precompiling for 24 choices 2025-12-04T09:58:55.9421506Z Autotune Choices Stats: 2025-12-04T09:58:55.9422268Z {"num_choices": 22, "num_triton_choices": 22, "best_kernel": "triton_flex_attention_backward_2295", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION=\"'ieee'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4", "best_time": 0.01559900026768446, "best_triton_pos": 0} 2025-12-04T09:58:55.9422489Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T09:58:55.9422657Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T09:58:55.9422951Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T09:58:55.9423595Z triton_flex_attention_backward_2295 0.0156 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9424222Z triton_flex_attention_backward_2289 0.0184 ms 85.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9424855Z triton_flex_attention_backward_2287 0.0186 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9425502Z triton_flex_attention_backward_2286 0.0188 ms 83.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9426172Z triton_flex_attention_backward_2297 0.0202 ms 77.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9426807Z triton_flex_attention_backward_2296 0.0203 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9427451Z triton_flex_attention_backward_2294 0.0218 ms 71.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9428099Z triton_flex_attention_backward_2299 0.0220 ms 71.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=64, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=8 2025-12-04T09:58:55.9428730Z triton_flex_attention_backward_2290 0.0228 ms 68.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=0, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9429367Z triton_flex_attention_backward_2281 0.0229 ms 68.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=16, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=16, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=1, num_warps=4 2025-12-04T09:58:55.9429509Z SingleProcess AUTOTUNE benchmarking takes 0.2617 seconds and 0.8243 seconds precompiling for 22 choices 2025-12-04T09:58:55.9429742Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-8ea7c7770886d406.xml - 2025-12-04T09:58:55.9429807Z =========================== short test summary info ============================ 2025-12-04T09:58:55.9430121Z FAILED [4.7262s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp18_m9leu/flex_attention_configs.json was not created 2025-12-04T09:58:55.9430126Z 2025-12-04T09:58:55.9430203Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9430370Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9430373Z 2025-12-04T09:58:55.9430464Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9430736Z FAILED [4.6440s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpfn7vxgza/flex_attention_configs.json was not created 2025-12-04T09:58:55.9430738Z 2025-12-04T09:58:55.9430812Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9430971Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9430974Z 2025-12-04T09:58:55.9431063Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9431341Z FAILED [4.8129s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp0m7dzmdb/flex_attention_configs.json was not created 2025-12-04T09:58:55.9431345Z 2025-12-04T09:58:55.9431420Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9431597Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9431602Z 2025-12-04T09:58:55.9431688Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9431955Z FAILED [4.7262s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpkot2t0ts/flex_attention_configs.json was not created 2025-12-04T09:58:55.9431957Z 2025-12-04T09:58:55.9432028Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9432184Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9432187Z 2025-12-04T09:58:55.9432271Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9432537Z FAILED [4.4215s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmppih0duwj/flex_attention_configs.json was not created 2025-12-04T09:58:55.9432539Z 2025-12-04T09:58:55.9432609Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9432766Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9432768Z 2025-12-04T09:58:55.9432850Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9433115Z FAILED [4.0950s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpho6dwhch/flex_attention_configs.json was not created 2025-12-04T09:58:55.9433129Z 2025-12-04T09:58:55.9433203Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9433357Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9433359Z 2025-12-04T09:58:55.9433443Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9433719Z FAILED [4.3377s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpbx2xz6g6/flex_attention_configs.json was not created 2025-12-04T09:58:55.9433721Z 2025-12-04T09:58:55.9433793Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9433950Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9433953Z 2025-12-04T09:58:55.9434039Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9434305Z FAILED [4.4796s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpq51ab7fj/flex_attention_configs.json was not created 2025-12-04T09:58:55.9434309Z 2025-12-04T09:58:55.9434379Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9434536Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9434539Z 2025-12-04T09:58:55.9434622Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9434889Z FAILED [4.2016s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpuowzk5ja/flex_attention_configs.json was not created 2025-12-04T09:58:55.9434891Z 2025-12-04T09:58:55.9434982Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9435140Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9435152Z 2025-12-04T09:58:55.9435234Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9435499Z FAILED [4.4079s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpvkf0nme2/flex_attention_configs.json was not created 2025-12-04T09:58:55.9435502Z 2025-12-04T09:58:55.9435572Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9435727Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9435729Z 2025-12-04T09:58:55.9435813Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9436125Z FAILED [4.3684s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpm2zzky_x/flex_attention_configs.json was not created 2025-12-04T09:58:55.9436128Z 2025-12-04T09:58:55.9436201Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9436358Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9436359Z 2025-12-04T09:58:55.9436444Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9436709Z FAILED [4.3639s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpr1x5bd4b/flex_attention_configs.json was not created 2025-12-04T09:58:55.9436710Z 2025-12-04T09:58:55.9436782Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9436952Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9436954Z 2025-12-04T09:58:55.9437039Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9437305Z FAILED [4.3081s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmph000f7qb/flex_attention_configs.json was not created 2025-12-04T09:58:55.9437307Z 2025-12-04T09:58:55.9437391Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9437546Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9437548Z 2025-12-04T09:58:55.9437632Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9437897Z FAILED [4.6781s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpthqoqawy/flex_attention_configs.json was not created 2025-12-04T09:58:55.9437900Z 2025-12-04T09:58:55.9437970Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9438125Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9438127Z 2025-12-04T09:58:55.9438209Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9438484Z FAILED [4.6441s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp6db4qp8j/flex_attention_configs.json was not created 2025-12-04T09:58:55.9438485Z 2025-12-04T09:58:55.9438556Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9438727Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9438729Z 2025-12-04T09:58:55.9438826Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9439099Z FAILED [4.6391s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpox4mtzl8/flex_attention_configs.json was not created 2025-12-04T09:58:55.9439100Z 2025-12-04T09:58:55.9439172Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9439331Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9439333Z 2025-12-04T09:58:55.9439417Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9439681Z FAILED [4.8884s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpa823a6nj/flex_attention_configs.json was not created 2025-12-04T09:58:55.9439685Z 2025-12-04T09:58:55.9439755Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9439912Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9439914Z 2025-12-04T09:58:55.9439997Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9440262Z FAILED [4.5969s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp79ygt4gy/flex_attention_configs.json was not created 2025-12-04T09:58:55.9440264Z 2025-12-04T09:58:55.9440333Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9440488Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9440501Z 2025-12-04T09:58:55.9440584Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9440854Z FAILED [4.6290s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpmiti4lfu/flex_attention_configs.json was not created 2025-12-04T09:58:55.9440857Z 2025-12-04T09:58:55.9440927Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9441094Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9441096Z 2025-12-04T09:58:55.9441182Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9441446Z FAILED [4.4493s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp8zeqdgl2/flex_attention_configs.json was not created 2025-12-04T09:58:55.9441449Z 2025-12-04T09:58:55.9441521Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9441676Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9441678Z 2025-12-04T09:58:55.9441761Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9442027Z FAILED [4.2069s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpjnhi31tc/flex_attention_configs.json was not created 2025-12-04T09:58:55.9442029Z 2025-12-04T09:58:55.9442102Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9442256Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9442258Z 2025-12-04T09:58:55.9445565Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9445838Z FAILED [4.7474s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp3pg9sr7g/flex_attention_configs.json was not created 2025-12-04T09:58:55.9445857Z 2025-12-04T09:58:55.9445981Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9446142Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9446144Z 2025-12-04T09:58:55.9446231Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9446501Z FAILED [4.2224s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp0frn1eqy/flex_attention_configs.json was not created 2025-12-04T09:58:55.9446505Z 2025-12-04T09:58:55.9446576Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9446731Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9446734Z 2025-12-04T09:58:55.9446817Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9447083Z FAILED [5.0817s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpxarurl0o/flex_attention_configs.json was not created 2025-12-04T09:58:55.9447085Z 2025-12-04T09:58:55.9447154Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9447309Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9447311Z 2025-12-04T09:58:55.9447415Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9447678Z FAILED [4.4318s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpxgf0yi50/flex_attention_configs.json was not created 2025-12-04T09:58:55.9447680Z 2025-12-04T09:58:55.9447753Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9447906Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9447907Z 2025-12-04T09:58:55.9448006Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9448270Z FAILED [4.6751s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmprdkgqj0a/flex_attention_configs.json was not created 2025-12-04T09:58:55.9448271Z 2025-12-04T09:58:55.9448344Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9448499Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9448503Z 2025-12-04T09:58:55.9448587Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9448852Z FAILED [4.3797s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp5gigeitq/flex_attention_configs.json was not created 2025-12-04T09:58:55.9448854Z 2025-12-04T09:58:55.9448925Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9449080Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9449082Z 2025-12-04T09:58:55.9449164Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9449457Z FAILED [4.8819s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpoqq18k0p/flex_attention_configs.json was not created 2025-12-04T09:58:55.9449476Z 2025-12-04T09:58:55.9449546Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9449700Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9449702Z 2025-12-04T09:58:55.9449785Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9450049Z FAILED [4.5386s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpq1f39omc/flex_attention_configs.json was not created 2025-12-04T09:58:55.9450051Z 2025-12-04T09:58:55.9450123Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9450280Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9450282Z 2025-12-04T09:58:55.9450367Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9450629Z FAILED [4.4898s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp1l51gxyl/flex_attention_configs.json was not created 2025-12-04T09:58:55.9450631Z 2025-12-04T09:58:55.9450703Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9450858Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9450860Z 2025-12-04T09:58:55.9450945Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9451209Z FAILED [6.0320s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp3xyvfcnn/flex_attention_configs.json was not created 2025-12-04T09:58:55.9451227Z 2025-12-04T09:58:55.9451298Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9451454Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9451456Z 2025-12-04T09:58:55.9451540Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9451814Z FAILED [4.9406s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmprqbilwpb/flex_attention_configs.json was not created 2025-12-04T09:58:55.9451816Z 2025-12-04T09:58:55.9451886Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9452044Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9452047Z 2025-12-04T09:58:55.9452130Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9452393Z FAILED [4.4893s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpfcr2ai80/flex_attention_configs.json was not created 2025-12-04T09:58:55.9452395Z 2025-12-04T09:58:55.9452466Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9452623Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9452625Z 2025-12-04T09:58:55.9452708Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9452989Z FAILED [4.7407s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpjf4cuoyj/flex_attention_configs.json was not created 2025-12-04T09:58:55.9452993Z 2025-12-04T09:58:55.9453074Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9453229Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9453232Z 2025-12-04T09:58:55.9453315Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9453579Z FAILED [4.4628s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpis9kuz2a/flex_attention_configs.json was not created 2025-12-04T09:58:55.9453580Z 2025-12-04T09:58:55.9453650Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9453805Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9453808Z 2025-12-04T09:58:55.9453893Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9454155Z FAILED [4.8807s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpg_se1byr/flex_attention_configs.json was not created 2025-12-04T09:58:55.9454157Z 2025-12-04T09:58:55.9454227Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9454384Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9454386Z 2025-12-04T09:58:55.9454469Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9454735Z FAILED [4.7286s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpit3gbah7/flex_attention_configs.json was not created 2025-12-04T09:58:55.9454752Z 2025-12-04T09:58:55.9454824Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9454980Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9454982Z 2025-12-04T09:58:55.9455065Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9455344Z FAILED [4.6135s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpuv32uu08/flex_attention_configs.json was not created 2025-12-04T09:58:55.9455346Z 2025-12-04T09:58:55.9455417Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9455571Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9455574Z 2025-12-04T09:58:55.9455661Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9455953Z FAILED [5.0149s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp8s4y4nc_/flex_attention_configs.json was not created 2025-12-04T09:58:55.9455957Z 2025-12-04T09:58:55.9456028Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9456183Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9456186Z 2025-12-04T09:58:55.9456272Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9456535Z FAILED [4.4073s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp1x81keg9/flex_attention_configs.json was not created 2025-12-04T09:58:55.9456540Z 2025-12-04T09:58:55.9456626Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9456782Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9456797Z 2025-12-04T09:58:55.9456880Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9457141Z FAILED [4.8554s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpw_3v__bo/flex_attention_configs.json was not created 2025-12-04T09:58:55.9457144Z 2025-12-04T09:58:55.9457213Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9457370Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9457372Z 2025-12-04T09:58:55.9457455Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9457724Z FAILED [4.8514s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpvi278rjz/flex_attention_configs.json was not created 2025-12-04T09:58:55.9457727Z 2025-12-04T09:58:55.9457797Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9457952Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9457955Z 2025-12-04T09:58:55.9458039Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9458303Z FAILED [4.6409s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpyi1436_p/flex_attention_configs.json was not created 2025-12-04T09:58:55.9458305Z 2025-12-04T09:58:55.9458377Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9458546Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9458548Z 2025-12-04T09:58:55.9458632Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9458894Z FAILED [4.7760s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpguk9qo1r/flex_attention_configs.json was not created 2025-12-04T09:58:55.9458896Z 2025-12-04T09:58:55.9458979Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9459133Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9459135Z 2025-12-04T09:58:55.9459219Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9459483Z FAILED [4.7026s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpr79ss940/flex_attention_configs.json was not created 2025-12-04T09:58:55.9459487Z 2025-12-04T09:58:55.9459558Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9459719Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9459721Z 2025-12-04T09:58:55.9459803Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9460068Z FAILED [4.8861s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpy5ckrs5_/flex_attention_configs.json was not created 2025-12-04T09:58:55.9460070Z 2025-12-04T09:58:55.9460139Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9460306Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9460318Z 2025-12-04T09:58:55.9460403Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9460666Z FAILED [4.6108s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp2hax7tss/flex_attention_configs.json was not created 2025-12-04T09:58:55.9460668Z 2025-12-04T09:58:55.9460739Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9460894Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9460896Z 2025-12-04T09:58:55.9460978Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9461242Z FAILED [4.8224s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp2um3xr6u/flex_attention_configs.json was not created 2025-12-04T09:58:55.9461246Z 2025-12-04T09:58:55.9461317Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9461471Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9461473Z 2025-12-04T09:58:55.9461557Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9461822Z FAILED [4.6622s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpp3c9nvxc/flex_attention_configs.json was not created 2025-12-04T09:58:55.9461823Z 2025-12-04T09:58:55.9461898Z To execute this test, run the following from the base repo dir: 2025-12-04T09:58:55.9462052Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T09:58:55.9462067Z 2025-12-04T09:58:55.9462150Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T09:58:55.9462221Z =================== 49 failed, 1 passed in 235.06s (0:03:55) =================== 2025-12-04T09:58:55.9462223Z 2025-12-04T09:58:55.9462396Z FINISHED PRINTING LOG FILE of inductor/test_flex_attention 1/4 (test/test-reports/inductor.test_flex_attention_1.4_583be521806f48fb_.log) 2025-12-04T09:58:55.9462399Z 2025-12-04T09:58:55.9462534Z Finished inductor/test_flex_attention 1/4 ... [2025-12-04 09:58:53.667144][5636354.173533631], took 4.05min 2025-12-04T09:58:55.9462774Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:58:55.9462896Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:58:55.9462994Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T09:58:55.9463047Z Uploading artifacts took 0.00 seconds 2025-12-04T09:58:55.9463099Z inductor/test_flex_attention 1/4 failed! 2025-12-04T09:58:55.9463191Z Running inductor/test_halide 1/1 ... [2025-12-04 09:58:53.756562][5636354.262960601] 2025-12-04T09:58:55.9463239Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:58:55.9463601Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_halide.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:58:53.756892] 2025-12-04T09:58:59.4625326Z 2025-12-04T09:58:59.4626488Z inductor/test_halide 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_halide_1.1_be48a5c8fac04e84_.log 2025-12-04T09:58:59.4627268Z 2025-12-04T09:58:59.4627677Z Finished inductor/test_halide 1/1 ... [2025-12-04 09:58:59.462100][5636359.968501531], took 0.10min 2025-12-04T09:58:59.4631295Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:58:59.5498510Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:58:59.5506132Z Running inductor/test_compile_subprocess 2/3 ... [2025-12-04 09:58:59.550311][5636360.056709021] 2025-12-04T09:58:59.5506811Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:58:59.5508578Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_compile_subprocess.py', '--shard-id=2', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:58:59.550621] 2025-12-04T09:59:39.8505065Z 2025-12-04T09:59:39.8506736Z inductor/test_compile_subprocess 2/3 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_compile_subprocess_2.3_b31b3be84da8917a_.log 2025-12-04T09:59:39.8527535Z Running 50 items in this shard: test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_compile_subprocess.py::GPUTests::test_dropout_deterministic_cuda 2025-12-04T09:59:39.8546820Z 2025-12-04T09:59:39.8547269Z Finished inductor/test_compile_subprocess 2/3 ... [2025-12-04 09:59:39.851427][5636400.357828772], took 0.67min 2025-12-04T09:59:39.8548626Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:59:39.9386617Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:59:39.9396578Z Running inductor/test_deterministic 4/4 ... [2025-12-04 09:59:39.939171][5636400.445570367] 2025-12-04T09:59:39.9397257Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:59:39.9398960Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_deterministic.py', '--shard-id=4', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:59:39.939489] 2025-12-04T09:59:45.3920480Z 2025-12-04T09:59:45.3921890Z inductor/test_deterministic 4/4 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_deterministic_4.4_264a18eadf0999f5_.log 2025-12-04T09:59:45.3923537Z Running 0 items in this shard: 2025-12-04T09:59:45.3923795Z 2025-12-04T09:59:45.3924216Z Finished inductor/test_deterministic 4/4 ... [2025-12-04 09:59:45.391580][5636405.897978366], took 0.09min 2025-12-04T09:59:45.3930437Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:59:45.4799009Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:59:45.4807919Z Running export/test_functionalized_assertions 1/1 ... [2025-12-04 09:59:45.480643][5636405.987039126] 2025-12-04T09:59:45.4808624Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:59:45.4814010Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'export/test_functionalized_assertions.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:59:45.480987] 2025-12-04T09:59:47.2317531Z 2025-12-04T09:59:47.2319025Z export/test_functionalized_assertions 1/1 was successful, full logs can be found in artifacts with path test/test-reports/export.test_functionalized_assertions_1.1_e3469c6fa3276c69_.log 2025-12-04T09:59:47.2320158Z Running 0 items in this shard: 2025-12-04T09:59:47.2320413Z 2025-12-04T09:59:47.2321561Z Finished export/test_functionalized_assertions 1/1 ... [2025-12-04 09:59:47.231417][5636407.73781643], took 0.03min 2025-12-04T09:59:47.2328467Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T09:59:47.3230084Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T09:59:47.3237995Z Running inductor/test_loop_ordering 1/1 ... [2025-12-04 09:59:47.323487][5636407.829884729] 2025-12-04T09:59:47.3238678Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T09:59:47.3240662Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_loop_ordering.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:59:47.323829] 2025-12-04T10:00:21.8753699Z 2025-12-04T10:00:21.8755130Z inductor/test_loop_ordering 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_loop_ordering_1.1_16860e7c04ca5666_.log 2025-12-04T10:00:21.8777651Z Running 50 items in this shard: test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template, test/inductor/test_loop_ordering.py::LoopOrderingTest::test_interaction_with_triton_template 2025-12-04T10:00:21.8799659Z 2025-12-04T10:00:21.8800072Z Finished inductor/test_loop_ordering 1/1 ... [2025-12-04 10:00:21.874954][5636442.381356509], took 0.58min 2025-12-04T10:00:21.8801425Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:00:21.9626701Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:00:21.9631593Z Running export/test_serialize 1/1 ... [2025-12-04 10:00:21.962969][5636442.469366992] 2025-12-04T10:00:21.9632218Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:00:21.9636953Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'export/test_serialize.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:00:21.963317] 2025-12-04T10:00:27.3304875Z 2025-12-04T10:00:27.3306462Z export/test_serialize 1/1 was successful, full logs can be found in artifacts with path test/test-reports/export.test_serialize_1.1_ee01c89353795448_.log 2025-12-04T10:00:27.3307371Z Running 0 items in this shard: 2025-12-04T10:00:27.3307586Z 2025-12-04T10:00:27.3307906Z Finished export/test_serialize 1/1 ... [2025-12-04 10:00:27.330100][5636447.836502201], took 0.09min 2025-12-04T10:00:27.3312685Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:00:27.4187383Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:00:27.4194249Z Running inductor/test_cutedsl_template 1/1 ... [2025-12-04 10:00:27.419255][5636447.925651964] 2025-12-04T10:00:27.4194960Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:00:27.4198171Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cutedsl_template.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:00:27.419607] 2025-12-04T10:00:32.7075852Z 2025-12-04T10:00:32.7078076Z inductor/test_cutedsl_template 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cutedsl_template_1.1_dcc996aefdec823b_.log 2025-12-04T10:00:32.7079248Z Running 0 items in this shard: 2025-12-04T10:00:32.7079505Z 2025-12-04T10:00:32.7079920Z Finished inductor/test_cutedsl_template 1/1 ... [2025-12-04 10:00:32.707189][5636453.21358857], took 0.09min 2025-12-04T10:00:32.7086432Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:00:32.7991993Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:00:32.7999836Z Running inductor/test_benchmark_fusion 1/1 ... [2025-12-04 10:00:32.799556][5636453.305953358] 2025-12-04T10:00:32.8000552Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:00:32.8002132Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_benchmark_fusion.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:00:32.799877] 2025-12-04T10:04:38.8421605Z 2025-12-04T10:04:38.8422810Z PRINTING LOG FILE of inductor/test_benchmark_fusion 1/1 (test/test-reports/inductor.test_benchmark_fusion_1.1_ea0e2b7da1ec2de3_.log) 2025-12-04T10:04:38.8424271Z Test results will be stored in test-reports/python-pytest/inductor.test_benchmark_fusion/inductor.test_benchmark_fusion-32f423bfb0824e63.xml 2025-12-04T10:04:38.8425212Z ============================= test session starts ============================== 2025-12-04T10:04:38.8426067Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T10:04:38.8426702Z cachedir: .pytest_cache 2025-12-04T10:04:38.8427467Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T10:04:38.8429040Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T10:04:38.8429467Z configfile: pytest.ini 2025-12-04T10:04:38.8430467Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T10:04:38.8431309Z collecting ... collected 16 items 2025-12-04T10:04:38.8431782Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T10:04:38.8479768Z Running 100 items in this shard: test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code 2025-12-04T10:04:38.8526652Z 2025-12-04T10:04:38.8527318Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [4.9296s] [ 1%] 2025-12-04T10:04:38.8529100Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:00:43.561000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8531203Z E1204 10:00:43.561000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8532730Z E1204 10:00:43.561000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8533764Z E1204 10:00:43.599000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8535332Z E1204 10:00:43.599000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8536929Z E1204 10:00:43.599000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8537601Z PASSED [1.8841s] [ 2%] 2025-12-04T10:04:38.8538516Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7552s] [ 2%] 2025-12-04T10:04:38.8539910Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7538s] [ 2%] 2025-12-04T10:04:38.8541282Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7129s] [ 2%] 2025-12-04T10:04:38.8542699Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.9829s] [ 2%] 2025-12-04T10:04:38.8544065Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7028s] [ 2%] 2025-12-04T10:04:38.8545429Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.6932s] [ 2%] 2025-12-04T10:04:38.8546910Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7344s] [ 2%] 2025-12-04T10:04:38.8548276Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.6475s] [ 2%] 2025-12-04T10:04:38.8549651Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.6275s] [ 2%] 2025-12-04T10:04:38.8551015Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.5790s] [ 2%] 2025-12-04T10:04:38.8552375Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7388s] [ 2%] 2025-12-04T10:04:38.8553810Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [3.2588s] [ 2%] 2025-12-04T10:04:38.8555170Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7313s] [ 2%] 2025-12-04T10:04:38.8556641Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.6835s] [ 2%] 2025-12-04T10:04:38.8557998Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7220s] [ 2%] 2025-12-04T10:04:38.8559365Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7176s] [ 2%] 2025-12-04T10:04:38.8560718Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.8120s] [ 2%] 2025-12-04T10:04:38.8562080Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.8110s] [ 2%] 2025-12-04T10:04:38.8563439Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7599s] [ 2%] 2025-12-04T10:04:38.8564804Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7709s] [ 2%] 2025-12-04T10:04:38.8566214Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [3.1361s] [ 2%] 2025-12-04T10:04:38.8567580Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7188s] [ 2%] 2025-12-04T10:04:38.8568934Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [3.1013s] [ 2%] 2025-12-04T10:04:38.8570339Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7529s] [ 2%] 2025-12-04T10:04:38.8571696Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7758s] [ 2%] 2025-12-04T10:04:38.8573074Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7986s] [ 2%] 2025-12-04T10:04:38.8574471Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7475s] [ 2%] 2025-12-04T10:04:38.8575824Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7164s] [ 2%] 2025-12-04T10:04:38.8577241Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.6874s] [ 2%] 2025-12-04T10:04:38.8578594Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [3.0778s] [ 2%] 2025-12-04T10:04:38.8579953Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7321s] [ 2%] 2025-12-04T10:04:38.8581306Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.6974s] [ 2%] 2025-12-04T10:04:38.8582656Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.6892s] [ 2%] 2025-12-04T10:04:38.8584009Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.6826s] [ 2%] 2025-12-04T10:04:38.8585427Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.8594s] [ 2%] 2025-12-04T10:04:38.8586876Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.6621s] [ 2%] 2025-12-04T10:04:38.8588226Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.8075s] [ 2%] 2025-12-04T10:04:38.8589579Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7879s] [ 2%] 2025-12-04T10:04:38.8590931Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7441s] [ 2%] 2025-12-04T10:04:38.8592282Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [3.1176s] [ 2%] 2025-12-04T10:04:38.8593647Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7256s] [ 2%] 2025-12-04T10:04:38.8595022Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7461s] [ 2%] 2025-12-04T10:04:38.8596417Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7468s] [ 2%] 2025-12-04T10:04:38.8597786Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7067s] [ 2%] 2025-12-04T10:04:38.8599160Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7938s] [ 2%] 2025-12-04T10:04:38.8600519Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.7698s] [ 2%] 2025-12-04T10:04:38.8601919Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.9942s] [ 2%] 2025-12-04T10:04:38.8603283Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [2.8341s] [ 2%] 2025-12-04T10:04:38.8604679Z inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda <- test/inductor/test_torchinductor.py PASSED [3.1365s] [ 2%] 2025-12-04T10:04:38.8606446Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:02.344000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8608514Z E1204 10:03:02.344000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8610030Z E1204 10:03:02.344000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8611044Z E1204 10:03:02.386000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8612608Z E1204 10:03:02.386000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8614099Z E1204 10:03:02.386000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8615119Z W1204 10:03:04.150000 244705 site-packages/torch/_inductor/utils.py:1361] on error, temporary cache dir kept at /tmp/tmple4d_89u 2025-12-04T10:04:38.8615849Z FAILED [2.5344s] [ 2%] 2025-12-04T10:04:38.8617113Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:04.850000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8619195Z E1204 10:03:04.850000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8620683Z E1204 10:03:04.850000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8621685Z E1204 10:03:04.890000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8623242Z E1204 10:03:04.890000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8624735Z E1204 10:03:04.890000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8625397Z PASSED [1.7499s] [ 2%] 2025-12-04T10:04:38.8626614Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:06.589000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8628652Z E1204 10:03:06.589000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8630148Z E1204 10:03:06.589000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8631192Z E1204 10:03:06.628000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8632737Z E1204 10:03:06.628000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8634221Z E1204 10:03:06.628000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8634883Z PASSED [1.6735s] [ 2%] 2025-12-04T10:04:38.8636152Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:08.310000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8638204Z E1204 10:03:08.310000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8639690Z E1204 10:03:08.310000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8640693Z E1204 10:03:08.348000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8642248Z E1204 10:03:08.348000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8643730Z E1204 10:03:08.348000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8644384Z PASSED [1.6411s] [ 2%] 2025-12-04T10:04:38.8645579Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:09.936000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8647666Z E1204 10:03:09.936000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8649201Z E1204 10:03:09.936000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8650209Z E1204 10:03:09.973000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8651752Z E1204 10:03:09.973000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8653236Z E1204 10:03:09.973000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8653896Z PASSED [1.6900s] [ 2%] 2025-12-04T10:04:38.8655048Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:11.619000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8657142Z E1204 10:03:11.619000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8658623Z E1204 10:03:11.619000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8659624Z E1204 10:03:11.657000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8661173Z E1204 10:03:11.657000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8662712Z E1204 10:03:11.657000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8663367Z PASSED [1.8498s] [ 2%] 2025-12-04T10:04:38.8664593Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:13.886000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8666701Z E1204 10:03:13.886000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8668187Z E1204 10:03:13.886000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8669196Z E1204 10:03:13.928000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8670747Z E1204 10:03:13.928000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8672235Z E1204 10:03:13.928000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8672896Z PASSED [2.1801s] [ 2%] 2025-12-04T10:04:38.8674053Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:15.665000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8676217Z E1204 10:03:15.665000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8677739Z E1204 10:03:15.665000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8678736Z E1204 10:03:15.702000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8680283Z E1204 10:03:15.702000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8681767Z E1204 10:03:15.702000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8682423Z PASSED [1.6782s] [ 2%] 2025-12-04T10:04:38.8683589Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:17.324000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8685631Z E1204 10:03:17.324000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8687187Z E1204 10:03:17.324000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8688204Z E1204 10:03:17.359000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8689761Z E1204 10:03:17.359000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8691302Z E1204 10:03:17.359000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8691963Z PASSED [1.6276s] [ 2%] 2025-12-04T10:04:38.8693126Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:18.954000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8695202Z E1204 10:03:18.954000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8696737Z E1204 10:03:18.954000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8697741Z E1204 10:03:18.991000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8699294Z E1204 10:03:18.991000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8700782Z E1204 10:03:18.991000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8701435Z PASSED [1.6187s] [ 2%] 2025-12-04T10:04:38.8702601Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:20.624000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8704648Z E1204 10:03:20.624000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8706236Z E1204 10:03:20.624000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8707242Z E1204 10:03:20.664000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8708834Z E1204 10:03:20.664000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8710332Z E1204 10:03:20.664000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8711002Z PASSED [1.8117s] [ 2%] 2025-12-04T10:04:38.8712159Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:22.415000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8714207Z E1204 10:03:22.415000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8715707Z E1204 10:03:22.415000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8716761Z E1204 10:03:22.454000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8718329Z E1204 10:03:22.454000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8719813Z E1204 10:03:22.454000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8720471Z PASSED [1.7372s] [ 2%] 2025-12-04T10:04:38.8721674Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:24.111000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8723717Z E1204 10:03:24.111000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8725251Z E1204 10:03:24.111000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8726298Z E1204 10:03:24.146000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8727857Z E1204 10:03:24.146000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8729355Z E1204 10:03:24.146000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8730015Z PASSED [1.6725s] [ 2%] 2025-12-04T10:04:38.8731190Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:25.811000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8733239Z E1204 10:03:25.811000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8734727Z E1204 10:03:25.811000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8735734Z E1204 10:03:25.850000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8737375Z E1204 10:03:25.850000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8738915Z E1204 10:03:25.850000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8739573Z PASSED [2.2587s] [ 2%] 2025-12-04T10:04:38.8740732Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:28.072000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8742780Z E1204 10:03:28.072000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8744283Z E1204 10:03:28.072000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8745288Z E1204 10:03:28.114000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8746929Z E1204 10:03:28.114000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8748435Z E1204 10:03:28.114000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8749456Z W1204 10:03:29.651000 244705 site-packages/torch/_inductor/utils.py:1361] on error, temporary cache dir kept at /tmp/tmpral48sit 2025-12-04T10:04:38.8750179Z FAILED [2.2923s] [ 2%] 2025-12-04T10:04:38.8751377Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:30.359000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8753424Z E1204 10:03:30.359000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8754949Z E1204 10:03:30.359000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8755997Z E1204 10:03:30.399000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8757556Z E1204 10:03:30.399000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8759053Z E1204 10:03:30.399000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8759715Z PASSED [1.6932s] [ 2%] 2025-12-04T10:04:38.8760873Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:32.083000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8762949Z E1204 10:03:32.083000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8764435Z E1204 10:03:32.083000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8765432Z E1204 10:03:32.122000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8767070Z E1204 10:03:32.122000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8768596Z E1204 10:03:32.122000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8769255Z PASSED [1.6931s] [ 2%] 2025-12-04T10:04:38.8770422Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:33.755000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8772469Z E1204 10:03:33.755000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8773978Z E1204 10:03:33.755000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8774983Z E1204 10:03:33.790000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8776606Z E1204 10:03:33.790000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8797846Z E1204 10:03:33.790000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8798345Z PASSED [1.8013s] [ 2%] 2025-12-04T10:04:38.8799109Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:35.496000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8800582Z E1204 10:03:35.496000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8801528Z E1204 10:03:35.496000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8802160Z E1204 10:03:35.532000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8803176Z E1204 10:03:35.532000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8804112Z E1204 10:03:35.532000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8804534Z PASSED [1.8785s] [ 2%] 2025-12-04T10:04:38.8805273Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:37.489000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8806604Z E1204 10:03:37.489000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8807538Z E1204 10:03:37.489000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8808169Z E1204 10:03:37.529000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8809176Z E1204 10:03:37.529000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8810109Z E1204 10:03:37.529000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8810561Z PASSED [1.9664s] [ 2%] 2025-12-04T10:04:38.8811289Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:39.436000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8812575Z E1204 10:03:39.436000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8813506Z E1204 10:03:39.436000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8814144Z E1204 10:03:39.473000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8815120Z E1204 10:03:39.473000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8816113Z E1204 10:03:39.473000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8816527Z PASSED [1.7173s] [ 2%] 2025-12-04T10:04:38.8817262Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:41.129000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8818545Z E1204 10:03:41.129000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8819512Z E1204 10:03:41.129000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8820145Z E1204 10:03:41.165000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8821139Z E1204 10:03:41.165000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8822070Z E1204 10:03:41.165000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8822484Z PASSED [2.2398s] [ 2%] 2025-12-04T10:04:38.8823210Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:43.452000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8824491Z E1204 10:03:43.452000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8825429Z E1204 10:03:43.452000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8826111Z E1204 10:03:43.487000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8827085Z E1204 10:03:43.487000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8828024Z E1204 10:03:43.487000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8828469Z PASSED [1.7709s] [ 2%] 2025-12-04T10:04:38.8829202Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:45.157000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8830516Z E1204 10:03:45.157000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8831457Z E1204 10:03:45.157000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8832090Z E1204 10:03:45.192000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8833064Z E1204 10:03:45.192000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8834001Z E1204 10:03:45.192000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8834412Z PASSED [1.7008s] [ 2%] 2025-12-04T10:04:38.8835153Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:46.946000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8836484Z E1204 10:03:46.946000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8837423Z E1204 10:03:46.946000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8838085Z E1204 10:03:46.983000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8839059Z E1204 10:03:46.983000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8839994Z E1204 10:03:46.983000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8840432Z PASSED [1.8383s] [ 2%] 2025-12-04T10:04:38.8841162Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:48.840000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8842455Z E1204 10:03:48.840000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8843402Z E1204 10:03:48.840000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8844031Z E1204 10:03:48.880000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8845017Z E1204 10:03:48.880000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8846019Z E1204 10:03:48.880000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8846431Z PASSED [2.5380s] [ 2%] 2025-12-04T10:04:38.8847187Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:51.263000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8848503Z E1204 10:03:51.263000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8849436Z E1204 10:03:51.263000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8850064Z E1204 10:03:51.301000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8851036Z E1204 10:03:51.301000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8851978Z E1204 10:03:51.301000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8852392Z PASSED [2.0145s] [ 2%] 2025-12-04T10:04:38.8853119Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:53.245000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8854403Z E1204 10:03:53.245000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8855338Z E1204 10:03:53.245000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8856016Z E1204 10:03:53.283000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8857016Z E1204 10:03:53.283000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8857955Z E1204 10:03:53.283000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8858367Z PASSED [2.1332s] [ 2%] 2025-12-04T10:04:38.8859121Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:55.392000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8860411Z E1204 10:03:55.392000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8861351Z E1204 10:03:55.392000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8861982Z E1204 10:03:55.430000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8862958Z E1204 10:03:55.430000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8863893Z E1204 10:03:55.430000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8864307Z PASSED [2.0704s] [ 2%] 2025-12-04T10:04:38.8865039Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:57.471000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8866413Z E1204 10:03:57.471000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8867390Z E1204 10:03:57.471000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8868019Z E1204 10:03:57.511000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8868993Z E1204 10:03:57.511000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8869926Z E1204 10:03:57.511000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8870339Z PASSED [2.4208s] [ 2%] 2025-12-04T10:04:38.8871071Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:03:59.877000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8872359Z E1204 10:03:59.877000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8873295Z E1204 10:03:59.877000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8873922Z E1204 10:03:59.925000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8874895Z E1204 10:03:59.925000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8875875Z E1204 10:03:59.925000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8876399Z PASSED [1.7615s] [ 2%] 2025-12-04T10:04:38.8877130Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:01.646000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8878442Z E1204 10:04:01.646000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8879380Z E1204 10:04:01.646000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8880011Z E1204 10:04:01.685000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8880987Z E1204 10:04:01.685000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8881924Z E1204 10:04:01.685000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8882339Z PASSED [1.7958s] [ 2%] 2025-12-04T10:04:38.8883074Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:03.406000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8884368Z E1204 10:04:03.406000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8885339Z E1204 10:04:03.406000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8886071Z E1204 10:04:03.444000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8887049Z E1204 10:04:03.444000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8887986Z E1204 10:04:03.444000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8888400Z PASSED [1.7537s] [ 2%] 2025-12-04T10:04:38.8889143Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:05.207000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8890434Z E1204 10:04:05.207000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8891372Z E1204 10:04:05.207000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8892002Z E1204 10:04:05.247000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8892976Z E1204 10:04:05.247000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8893918Z E1204 10:04:05.247000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8894385Z PASSED [1.7033s] [ 2%] 2025-12-04T10:04:38.8895113Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:06.908000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8896457Z E1204 10:04:06.908000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8897439Z E1204 10:04:06.908000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8898069Z E1204 10:04:06.947000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8899046Z E1204 10:04:06.947000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8899987Z E1204 10:04:06.947000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8900401Z PASSED [1.8319s] [ 2%] 2025-12-04T10:04:38.8901130Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:08.743000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8902415Z E1204 10:04:08.743000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8903349Z E1204 10:04:08.743000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8904015Z E1204 10:04:08.785000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8905025Z E1204 10:04:08.785000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8906010Z E1204 10:04:08.785000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8906428Z PASSED [1.7531s] [ 2%] 2025-12-04T10:04:38.8907175Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:10.498000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8908468Z E1204 10:04:10.498000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8909415Z E1204 10:04:10.498000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8910053Z E1204 10:04:10.538000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8911041Z E1204 10:04:10.538000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8911986Z E1204 10:04:10.538000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8912401Z PASSED [1.7216s] [ 2%] 2025-12-04T10:04:38.8913133Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:12.207000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8914447Z E1204 10:04:12.207000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8915385Z E1204 10:04:12.207000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8916085Z E1204 10:04:12.247000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8917061Z E1204 10:04:12.247000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8917997Z E1204 10:04:12.247000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8918415Z PASSED [1.8487s] [ 2%] 2025-12-04T10:04:38.8919146Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:14.641000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8920426Z E1204 10:04:14.641000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8921359Z E1204 10:04:14.641000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8921985Z E1204 10:04:14.686000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8922990Z E1204 10:04:14.686000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8923953Z E1204 10:04:14.686000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8924371Z PASSED [2.2702s] [ 2%] 2025-12-04T10:04:38.8925101Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:16.305000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8926428Z E1204 10:04:16.305000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8927361Z E1204 10:04:16.305000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8927991Z E1204 10:04:16.345000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8928964Z E1204 10:04:16.345000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8929903Z E1204 10:04:16.345000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8930318Z PASSED [1.8459s] [ 2%] 2025-12-04T10:04:38.8931046Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:18.187000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8932329Z E1204 10:04:18.187000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8933293Z E1204 10:04:18.187000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8933918Z E1204 10:04:18.227000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8934911Z E1204 10:04:18.227000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8935845Z E1204 10:04:18.227000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8936320Z PASSED [1.7492s] [ 2%] 2025-12-04T10:04:38.8937064Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:19.943000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8938353Z E1204 10:04:19.943000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8939286Z E1204 10:04:19.943000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8939913Z E1204 10:04:19.982000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8940892Z E1204 10:04:19.982000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8941876Z E1204 10:04:19.982000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8942543Z W1204 10:04:21.368000 244705 site-packages/torch/_inductor/utils.py:1361] on error, temporary cache dir kept at /tmp/tmpn_rptzwz 2025-12-04T10:04:38.8943000Z FAILED [2.1570s] [ 2%] 2025-12-04T10:04:38.8943732Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:22.105000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8945015Z E1204 10:04:22.105000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8946000Z E1204 10:04:22.105000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8946639Z E1204 10:04:22.144000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8947617Z E1204 10:04:22.144000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8948552Z E1204 10:04:22.144000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8948970Z PASSED [1.8544s] [ 2%] 2025-12-04T10:04:38.8949698Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:23.955000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8950984Z E1204 10:04:23.955000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8951950Z E1204 10:04:23.955000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8952576Z E1204 10:04:23.994000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8953584Z E1204 10:04:23.994000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8954523Z E1204 10:04:23.994000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8954937Z PASSED [1.7525s] [ 2%] 2025-12-04T10:04:38.8955670Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:25.718000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8957040Z E1204 10:04:25.718000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8957978Z E1204 10:04:25.718000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8958611Z E1204 10:04:25.760000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8959586Z E1204 10:04:25.760000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8960554Z E1204 10:04:25.760000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8960968Z PASSED [2.2766s] [ 2%] 2025-12-04T10:04:38.8961727Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:27.980000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8963014Z E1204 10:04:27.980000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8963949Z E1204 10:04:27.980000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8964575Z E1204 10:04:28.017000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8965558Z E1204 10:04:28.017000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8966541Z E1204 10:04:28.017000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8966957Z PASSED [1.6684s] [ 2%] 2025-12-04T10:04:38.8967335Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:30.157000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8967954Z E1204 10:04:30.157000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8968422Z E1204 10:04:30.157000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8968727Z E1204 10:04:30.198000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8969197Z E1204 10:04:30.198000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8969663Z E1204 10:04:30.198000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8969867Z PASSED [2.2561s] [ 2%] 2025-12-04T10:04:38.8970223Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:31.913000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8970840Z E1204 10:04:31.913000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8971292Z E1204 10:04:31.913000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8971596Z E1204 10:04:31.953000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8972069Z E1204 10:04:31.953000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8972526Z E1204 10:04:31.953000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8972729Z PASSED [1.7739s] [ 2%] 2025-12-04T10:04:38.8973130Z inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code E1204 10:04:33.679000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8973771Z E1204 10:04:33.679000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8974229Z E1204 10:04:33.679000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8974541Z E1204 10:04:33.719000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Runtime error during autotuning: 2025-12-04T10:04:38.8975015Z E1204 10:04:33.719000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 65536 Hardware limit:65536 Reducing block sizes or `num_stages` may help.. 2025-12-04T10:04:38.8975474Z E1204 10:04:33.719000 244705 site-packages/torch/_inductor/select_algorithm.py:3696] [0/0] Ignoring this choice. 2025-12-04T10:04:38.8975682Z PASSED [1.9475s] [ 2%] 2025-12-04T10:04:38.8975745Z 2025-12-04T10:04:38.8975809Z =================================== FAILURES =================================== 2025-12-04T10:04:38.8976065Z ______ BenchmarkMultiTemplateFusionGpuTest.test_equivalent_template_code _______ 2025-12-04T10:04:38.8976259Z Traceback (most recent call last): 2025-12-04T10:04:38.8976507Z File "/var/lib/jenkins/pytorch/test/inductor/test_benchmark_fusion.py", line 303, in test_equivalent_template_code 2025-12-04T10:04:38.8976776Z ).check("" if config.cpp_wrapper else "return").run(out_code[0]) 2025-12-04T10:04:38.8976988Z RuntimeError: Expected to find "triton_tem_fused_addmm_relu_t_0" but did not find it 2025-12-04T10:04:38.8977166Z Searched string: 2025-12-04T10:04:38.8977279Z with torch.cuda._DeviceGuard(0): 2025-12-04T10:04:38.8977429Z torch.cuda.set_device(0) 2025-12-04T10:04:38.8977583Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float16) 2025-12-04T10:04:38.8977789Z # Topologically Sorted Source Nodes: [a], Original ATen: [aten.t, aten.addmm] 2025-12-04T10:04:38.8977970Z stream0 = get_raw_stream(0) 2025-12-04T10:04:38.8978139Z triton_tem_fused_addmm_t_0.run(arg2_1, arg0_1, buf0, 64, 1, 1, stream=stream0) 2025-12-04T10:04:38.8978304Z del arg0_1 2025-12-04T10:04:38.8978404Z del arg2_1 2025-12-04T10:04:38.8978532Z buf1 = buf0; del buf0 # reuse 2025-12-04T10:04:38.8978719Z # Topologically Sorted Source Nodes: [a, relu], Original ATen: [aten.addmm, aten.relu] 2025-12-04T10:04:38.8978902Z stream0 = get_raw_stream(0) 2025-12-04T10:04:38.8979064Z triton_poi_fused_addmm_relu_1.run(buf1, arg1_1, 65536, stream=stream0) 2025-12-04T10:04:38.8979233Z del arg1_1 2025-12-04T10:04:38.8979338Z return (buf1, ) 2025-12-04T10:04:38.8979409Z 2025-12-04T10:04:38.8979459Z runner = Runner(partitions=[]) 2025-12-04T10:04:38.8979578Z call = runner.call 2025-12-04T10:04:38.8979718Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T10:04:38.8979824Z 2025-12-04T10:04:38.8979826Z 2025-12-04T10:04:38.8979890Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T10:04:38.8980051Z from torch._dynamo.testing import rand_strided 2025-12-04T10:04:38.8980211Z from torch._inductor.utils import print_performance 2025-12-04T10:04:38.8980402Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.8980607Z arg1_1 = rand_strided((256, ), (1, ), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.8980809Z arg2_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.8980985Z fn = lambda: call([arg0_1, arg1_1, arg2_1]) 2025-12-04T10:04:38.8981150Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T10:04:38.8981276Z 2025-12-04T10:04:38.8981278Z 2025-12-04T10:04:38.8981324Z if __name__ == "__main__": 2025-12-04T10:04:38.8981500Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T10:04:38.8981693Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T10:04:38.8981851Z From CHECK: triton_tem_fused_addmm_relu_t_0 2025-12-04T10:04:38.8981938Z 2025-12-04T10:04:38.8981944Z 2025-12-04T10:04:38.8982019Z To execute this test, run the following from the base repo dir: 2025-12-04T10:04:38.8982320Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_benchmark_fusion.py BenchmarkMultiTemplateFusionGpuTest.test_equivalent_template_code 2025-12-04T10:04:38.8982546Z 2025-12-04T10:04:38.8982637Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:04:38.8982842Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.8983007Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.8983142Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.8983345Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.8984107Z inductor [('triton_bundler_save_kernel', 448), ('benchmarking.InductorBenchmarker.benchmark_gpu', 52), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 16), ('coordesc_tuning_bench', 7), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.8984807Z graph_break [] 2025-12-04T10:04:38.8984916Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.8985081Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.8985696Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:04:38.8986293Z current_size = base.storage().size() 2025-12-04T10:04:38.8986421Z Autotune Choices Stats: 2025-12-04T10:04:38.8986898Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_7", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T10:04:38.8987348Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.8987465Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.8987585Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.8987911Z triton_mm_7 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.8988421Z triton_mm_11 0.0062 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.8988926Z triton_mm_8 0.0063 ms 94.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.8989441Z triton_mm_6 0.0064 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.8989949Z triton_mm_12 0.0067 ms 89.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.8990473Z triton_mm_5 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.8990979Z triton_mm_9 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.8991488Z triton_mm_22 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.8991993Z triton_mm_13 0.0070 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.8992497Z triton_mm_19 0.0070 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.8992906Z SingleProcess AUTOTUNE benchmarking takes 0.2508 seconds and 0.3321 seconds precompiling for 36 choices 2025-12-04T10:04:38.8993215Z Compiled module path: /tmp/tmphbgk0pon/w4/cw4f4q7u3v7sk4jykcksc6x23v3vsehvxi7p7oiyzgltxtm22ko6.py 2025-12-04T10:04:38.8993531Z Compiled module path: /tmp/tmphbgk0pon/bf/cbfsx2nqaiielw4hd7yslddciqzngmnd2m2d4ae552nnm7wiy5f5.py 2025-12-04T10:04:38.8993799Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.8993957Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.8994095Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.8994294Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.8995103Z inductor [('triton_bundler_save_kernel', 488), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 47), ('benchmarking.InductorBenchmarker.benchmark', 37), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('coordesc_tuning_bench', 23), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1)] 2025-12-04T10:04:38.8995832Z graph_break [] 2025-12-04T10:04:38.8995984Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.8996146Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.8996303Z Autotune Choices Stats: 2025-12-04T10:04:38.8996743Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_77", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T10:04:38.8997198Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.8997312Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.8997428Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.8997749Z triton_mm_77 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.8998284Z triton_mm_84 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.8998804Z triton_mm_88 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.8999310Z triton_mm_81 0.0069 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.8999813Z triton_mm_79 0.0070 ms 84.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9000320Z triton_mm_83 0.0070 ms 84.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9000827Z triton_mm_85 0.0070 ms 84.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9001334Z triton_mm_91 0.0070 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9001841Z triton_mm_80 0.0071 ms 83.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9002364Z triton_mm_90 0.0071 ms 83.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9002771Z SingleProcess AUTOTUNE benchmarking takes 0.2675 seconds and 0.2612 seconds precompiling for 36 choices 2025-12-04T10:04:38.9003075Z Compiled module path: /tmp/tmple4d_89u/ba/cbago5xr5v2secbwecligdrfh4zzzsojjri3gfceyn2lclaua3wz.py 2025-12-04T10:04:38.9003402Z Compiled module path: /tmp/tmple4d_89u/bw/cbw6qbjuvv6m3ss7vtktie22ioyr2mwuvshkuutjpfcz3uoyzrcj.py 2025-12-04T10:04:38.9003681Z ______ BenchmarkMultiTemplateFusionGpuTest.test_equivalent_template_code _______ 2025-12-04T10:04:38.9003876Z Traceback (most recent call last): 2025-12-04T10:04:38.9004113Z File "/var/lib/jenkins/pytorch/test/inductor/test_benchmark_fusion.py", line 303, in test_equivalent_template_code 2025-12-04T10:04:38.9004379Z ).check("" if config.cpp_wrapper else "return").run(out_code[0]) 2025-12-04T10:04:38.9004590Z RuntimeError: Expected to find "triton_tem_fused_addmm_relu_t_0" but did not find it 2025-12-04T10:04:38.9004767Z Searched string: 2025-12-04T10:04:38.9004871Z with torch.cuda._DeviceGuard(0): 2025-12-04T10:04:38.9004997Z torch.cuda.set_device(0) 2025-12-04T10:04:38.9005145Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float16) 2025-12-04T10:04:38.9005349Z # Topologically Sorted Source Nodes: [a], Original ATen: [aten.t, aten.addmm] 2025-12-04T10:04:38.9005527Z stream0 = get_raw_stream(0) 2025-12-04T10:04:38.9005695Z triton_tem_fused_addmm_t_0.run(arg2_1, arg0_1, buf0, 32, 1, 1, stream=stream0) 2025-12-04T10:04:38.9005861Z del arg0_1 2025-12-04T10:04:38.9006005Z del arg2_1 2025-12-04T10:04:38.9006112Z buf1 = buf0; del buf0 # reuse 2025-12-04T10:04:38.9006326Z # Topologically Sorted Source Nodes: [a, relu], Original ATen: [aten.addmm, aten.relu] 2025-12-04T10:04:38.9006508Z stream0 = get_raw_stream(0) 2025-12-04T10:04:38.9006688Z triton_poi_fused_addmm_relu_1.run(buf1, arg1_1, 65536, stream=stream0) 2025-12-04T10:04:38.9006845Z del arg1_1 2025-12-04T10:04:38.9006947Z return (buf1, ) 2025-12-04T10:04:38.9007010Z 2025-12-04T10:04:38.9007056Z runner = Runner(partitions=[]) 2025-12-04T10:04:38.9007171Z call = runner.call 2025-12-04T10:04:38.9007299Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T10:04:38.9007400Z 2025-12-04T10:04:38.9007402Z 2025-12-04T10:04:38.9007466Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T10:04:38.9007621Z from torch._dynamo.testing import rand_strided 2025-12-04T10:04:38.9007778Z from torch._inductor.utils import print_performance 2025-12-04T10:04:38.9007964Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9008168Z arg1_1 = rand_strided((256, ), (1, ), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9008364Z arg2_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9008539Z fn = lambda: call([arg0_1, arg1_1, arg2_1]) 2025-12-04T10:04:38.9008695Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T10:04:38.9008798Z 2025-12-04T10:04:38.9008800Z 2025-12-04T10:04:38.9008844Z if __name__ == "__main__": 2025-12-04T10:04:38.9008993Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T10:04:38.9009183Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T10:04:38.9009340Z From CHECK: triton_tem_fused_addmm_relu_t_0 2025-12-04T10:04:38.9009425Z 2025-12-04T10:04:38.9009427Z 2025-12-04T10:04:38.9009503Z To execute this test, run the following from the base repo dir: 2025-12-04T10:04:38.9009799Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_benchmark_fusion.py BenchmarkMultiTemplateFusionGpuTest.test_equivalent_template_code 2025-12-04T10:04:38.9010039Z 2025-12-04T10:04:38.9010133Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:04:38.9010334Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9010491Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9010619Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9010815Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9011583Z inductor [('triton_bundler_save_kernel', 448), ('benchmarking.InductorBenchmarker.benchmark_gpu', 52), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 16), ('coordesc_tuning_bench', 7), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9012267Z graph_break [] 2025-12-04T10:04:38.9012371Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9012528Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9013125Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:04:38.9013688Z current_size = base.storage().size() 2025-12-04T10:04:38.9013815Z Autotune Choices Stats: 2025-12-04T10:04:38.9014268Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_7", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T10:04:38.9014729Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9014845Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9014962Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9015284Z triton_mm_7 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9015788Z triton_mm_11 0.0062 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9016344Z triton_mm_8 0.0063 ms 94.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9016844Z triton_mm_6 0.0064 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9017348Z triton_mm_12 0.0067 ms 89.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9017852Z triton_mm_5 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9018354Z triton_mm_9 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9018885Z triton_mm_22 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9019404Z triton_mm_13 0.0070 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9019907Z triton_mm_19 0.0070 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9020307Z SingleProcess AUTOTUNE benchmarking takes 0.2508 seconds and 0.3321 seconds precompiling for 36 choices 2025-12-04T10:04:38.9020614Z Compiled module path: /tmp/tmphbgk0pon/w4/cw4f4q7u3v7sk4jykcksc6x23v3vsehvxi7p7oiyzgltxtm22ko6.py 2025-12-04T10:04:38.9020928Z Compiled module path: /tmp/tmphbgk0pon/bf/cbfsx2nqaiielw4hd7yslddciqzngmnd2m2d4ae552nnm7wiy5f5.py 2025-12-04T10:04:38.9021180Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9021334Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9021465Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9021659Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9022457Z inductor [('triton_bundler_save_kernel', 488), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 47), ('benchmarking.InductorBenchmarker.benchmark', 37), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('coordesc_tuning_bench', 23), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1)] 2025-12-04T10:04:38.9023188Z graph_break [] 2025-12-04T10:04:38.9023292Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9023452Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9023604Z Autotune Choices Stats: 2025-12-04T10:04:38.9024036Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_77", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T10:04:38.9024482Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9024600Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9024719Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9025035Z triton_mm_77 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9025548Z triton_mm_84 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9026097Z triton_mm_88 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9026609Z triton_mm_81 0.0069 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9027132Z triton_mm_79 0.0070 ms 84.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9027646Z triton_mm_83 0.0070 ms 84.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9028145Z triton_mm_85 0.0070 ms 84.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9028646Z triton_mm_91 0.0070 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9029144Z triton_mm_80 0.0071 ms 83.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9029646Z triton_mm_90 0.0071 ms 83.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9030049Z SingleProcess AUTOTUNE benchmarking takes 0.2675 seconds and 0.2612 seconds precompiling for 36 choices 2025-12-04T10:04:38.9030349Z Compiled module path: /tmp/tmple4d_89u/ba/cbago5xr5v2secbwecligdrfh4zzzsojjri3gfceyn2lclaua3wz.py 2025-12-04T10:04:38.9030654Z Compiled module path: /tmp/tmple4d_89u/bw/cbw6qbjuvv6m3ss7vtktie22ioyr2mwuvshkuutjpfcz3uoyzrcj.py 2025-12-04T10:04:38.9030919Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9031080Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9031229Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9031422Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9032178Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9032855Z graph_break [] 2025-12-04T10:04:38.9032962Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9033120Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9033277Z Autotune Choices Stats: 2025-12-04T10:04:38.9033719Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_156", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.00595899997279048, "best_triton_pos": 0} 2025-12-04T10:04:38.9034164Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9034278Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9034394Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9034714Z triton_mm_156 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9035240Z triton_mm_157 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9035747Z triton_mm_150 0.0065 ms 91.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9036317Z triton_mm_162 0.0066 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9036826Z triton_mm_148 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9037324Z triton_mm_152 0.0069 ms 86.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9037823Z triton_mm_146 0.0069 ms 86.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9038331Z triton_mm_163 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9038839Z triton_mm_166 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9039369Z triton_mm_160 0.0071 ms 83.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9039783Z SingleProcess AUTOTUNE benchmarking takes 0.2364 seconds and 0.2725 seconds precompiling for 36 choices 2025-12-04T10:04:38.9040093Z Compiled module path: /tmp/tmpujpayw3u/6r/c6rppdseuymkfljpvakni5ryayiqjhbdwwu4ki6zlrr63hvc72ie.py 2025-12-04T10:04:38.9040409Z Compiled module path: /tmp/tmpujpayw3u/ij/cijgyswvhqcajl4jrzaivb4upptrtyrigdkzwa5u5gmh64bpx2sa.py 2025-12-04T10:04:38.9040662Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9040820Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9040950Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9041143Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9041897Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9042582Z graph_break [] 2025-12-04T10:04:38.9042687Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9042846Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9043002Z Autotune Choices Stats: 2025-12-04T10:04:38.9043438Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_221", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4", "best_time": 0.006440000142902136, "best_triton_pos": 0} 2025-12-04T10:04:38.9043903Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9044015Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9044131Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9044463Z triton_mm_221 0.0064 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9044976Z triton_mm_228 0.0065 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9045486Z triton_mm_227 0.0066 ms 98.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9046040Z triton_mm_223 0.0066 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9046547Z triton_mm_217 0.0068 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=32, BLOCK_N=16, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=2 2025-12-04T10:04:38.9047050Z triton_mm_219 0.0068 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9047571Z triton_mm_222 0.0068 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9048094Z triton_mm_229 0.0069 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9048600Z triton_mm_224 0.0070 ms 92.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9049100Z triton_mm_220 0.0070 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9049502Z SingleProcess AUTOTUNE benchmarking takes 0.2322 seconds and 0.2645 seconds precompiling for 36 choices 2025-12-04T10:04:38.9049807Z Compiled module path: /tmp/tmplqztcoos/c3/cc3vogulhiuqlbae6kejcpo5jtkrdbi67rnpmmy4y36lsooku3k7.py 2025-12-04T10:04:38.9050119Z Compiled module path: /tmp/tmplqztcoos/m3/cm3qvnsjgpjjt5hgtdu5whbh22cnjz2xnkuhttjsedptbmlzsnci.py 2025-12-04T10:04:38.9050367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9050523Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9050656Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9050851Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9051597Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9052294Z graph_break [] 2025-12-04T10:04:38.9052398Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9052556Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9052710Z Autotune Choices Stats: 2025-12-04T10:04:38.9053163Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_296", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.006399999838322401, "best_triton_pos": 0} 2025-12-04T10:04:38.9053608Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9053723Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9053842Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9054161Z triton_mm_296 0.0064 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9054674Z triton_mm_295 0.0065 ms 98.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9055180Z triton_mm_297 0.0066 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9055696Z triton_mm_299 0.0066 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9056245Z triton_mm_301 0.0066 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9056772Z triton_mm_306 0.0067 ms 95.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9057278Z triton_mm_293 0.0068 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9057786Z triton_mm_305 0.0069 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9058293Z triton_mm_294 0.0069 ms 92.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9058804Z triton_mm_309 0.0072 ms 88.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9059206Z SingleProcess AUTOTUNE benchmarking takes 0.2584 seconds and 0.2697 seconds precompiling for 36 choices 2025-12-04T10:04:38.9059515Z Compiled module path: /tmp/tmpdomlhjox/mc/cmcwujd5cizd4scnlcqifabifo5v5umjcgkbbpca7sn7rh76b2pf.py 2025-12-04T10:04:38.9059831Z Compiled module path: /tmp/tmpdomlhjox/b7/cb7ygmw7hpgnsj5eivnjg2ahmxqkaxd2nxdhlybrabuzi3gablpe.py 2025-12-04T10:04:38.9060094Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9060255Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9060387Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9060584Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9061356Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9062041Z graph_break [] 2025-12-04T10:04:38.9062147Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9062302Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9062456Z Autotune Choices Stats: 2025-12-04T10:04:38.9062894Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_369", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.00583899999037385, "best_triton_pos": 0} 2025-12-04T10:04:38.9063351Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9063464Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9063581Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9063899Z triton_mm_369 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9064425Z triton_mm_366 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9064941Z triton_mm_379 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9065443Z triton_mm_367 0.0064 ms 91.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9065978Z triton_mm_373 0.0066 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9066484Z triton_mm_368 0.0067 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9066990Z triton_mm_372 0.0068 ms 86.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9067502Z triton_mm_378 0.0069 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9068012Z triton_mm_382 0.0070 ms 83.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9068546Z triton_mm_377 0.0071 ms 82.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9068956Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.2696 seconds precompiling for 36 choices 2025-12-04T10:04:38.9069264Z Compiled module path: /tmp/tmprfi6eta4/jn/cjnxgjerka6xvgwxgssgtk4pjsfrtpl4q2zoeknjdhbqu4r7i3w2.py 2025-12-04T10:04:38.9069625Z Compiled module path: /tmp/tmprfi6eta4/jq/cjqcixbt64by5nhzholfgtbeoclbevma4msrl43l63jnlycw5kdc.py 2025-12-04T10:04:38.9069876Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9070035Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9070168Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9070364Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9071130Z inductor [('triton_bundler_save_kernel', 456), ('benchmarking.InductorBenchmarker.benchmark_gpu', 53), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 17), ('coordesc_tuning_bench', 8), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9071816Z graph_break [] 2025-12-04T10:04:38.9071921Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9072078Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9072233Z Autotune Choices Stats: 2025-12-04T10:04:38.9072685Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_439", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.0060800001956522465, "best_triton_pos": 0} 2025-12-04T10:04:38.9073145Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9073258Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9073373Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9073693Z triton_mm_439 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9074200Z triton_mm_437 0.0062 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9074711Z triton_mm_448 0.0069 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9075219Z triton_mm_449 0.0069 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9075727Z triton_mm_441 0.0070 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9076266Z triton_mm_444 0.0070 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9076785Z triton_mm_438 0.0070 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9077287Z triton_mm_440 0.0070 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9077799Z triton_mm_443 0.0070 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9078304Z triton_mm_450 0.0071 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9078711Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.2557 seconds precompiling for 36 choices 2025-12-04T10:04:38.9079013Z Compiled module path: /tmp/tmpiy_c81rd/4r/c4rkjb26sea3bukicooquz5gpwvhhyzs4ilgnntffmfie5m4e7vj.py 2025-12-04T10:04:38.9079317Z Compiled module path: /tmp/tmpiy_c81rd/7h/c7h5slbwz4bgun22psqksryrnky263lpgiwdpwilovbiegbv57da.py 2025-12-04T10:04:38.9079563Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9079720Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9079850Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9080044Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9080820Z inductor [('triton_bundler_save_kernel', 432), ('benchmarking.InductorBenchmarker.benchmark_gpu', 50), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 14), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9081515Z graph_break [] 2025-12-04T10:04:38.9081617Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9081776Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9081929Z Autotune Choices Stats: 2025-12-04T10:04:38.9082363Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_511", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.006519999820739031, "best_triton_pos": 0} 2025-12-04T10:04:38.9082807Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9082921Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9083039Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9083360Z triton_mm_511 0.0065 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9083871Z triton_mm_520 0.0069 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9084373Z triton_mm_512 0.0070 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9084880Z triton_mm_510 0.0070 ms 92.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9085399Z triton_mm_508 0.0076 ms 86.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9085912Z triton_mm_513 0.0076 ms 86.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9086462Z triton_mm_519 0.0076 ms 86.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9086973Z triton_mm_517 0.0077 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9087481Z triton_mm_525 0.0077 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9087997Z triton_mm_523 0.0078 ms 83.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9088400Z SingleProcess AUTOTUNE benchmarking takes 0.6616 seconds and 0.2698 seconds precompiling for 36 choices 2025-12-04T10:04:38.9088701Z Compiled module path: /tmp/tmpxt1vai9i/5w/c5wwu32cp233qqzv5dpouzaq4su75rkycnetna6tch46zq52fijk.py 2025-12-04T10:04:38.9089010Z Compiled module path: /tmp/tmpxt1vai9i/ul/cul5r4u6mig6mua5yzuigwwlhqkaevnlnnoz4omja5z45ojvaizf.py 2025-12-04T10:04:38.9089283Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9089456Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9089590Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9089785Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9090543Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9091231Z graph_break [] 2025-12-04T10:04:38.9091339Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9091500Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9091657Z Autotune Choices Stats: 2025-12-04T10:04:38.9092094Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_595", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T10:04:38.9092542Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9092655Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9092769Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9093091Z triton_mm_595 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9093620Z triton_mm_585 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9094127Z triton_mm_587 0.0064 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9094639Z triton_mm_584 0.0065 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9095140Z triton_mm_582 0.0066 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9095643Z triton_mm_589 0.0066 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9096193Z triton_mm_594 0.0067 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9096698Z triton_mm_592 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9097225Z triton_mm_593 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9097736Z triton_mm_598 0.0070 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9098153Z SingleProcess AUTOTUNE benchmarking takes 0.2574 seconds and 0.2680 seconds precompiling for 36 choices 2025-12-04T10:04:38.9098455Z Compiled module path: /tmp/tmp6audcjbe/zk/czkdqv2xbklr4p4dpjneas4fzxmjzd6odn7aavh4nwxtaztq6ock.py 2025-12-04T10:04:38.9098768Z Compiled module path: /tmp/tmp6audcjbe/kl/cklltvqedtgj6pyk5aaedy3c774zevfwxq7oxy6o53pxikluowui.py 2025-12-04T10:04:38.9099022Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9099179Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9099309Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9099503Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9100255Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9100938Z graph_break [] 2025-12-04T10:04:38.9101046Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9101203Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9101356Z Autotune Choices Stats: 2025-12-04T10:04:38.9101791Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_660", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T10:04:38.9102254Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9102368Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9102484Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9102824Z triton_mm_660 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9103337Z triton_mm_655 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9103844Z triton_mm_661 0.0061 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9104346Z triton_mm_656 0.0064 ms 92.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9104850Z triton_mm_653 0.0064 ms 91.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9105361Z triton_mm_657 0.0064 ms 91.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9105900Z triton_mm_654 0.0066 ms 90.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9106449Z triton_mm_659 0.0066 ms 90.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9106957Z triton_mm_666 0.0066 ms 89.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9107460Z triton_mm_667 0.0067 ms 88.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9107869Z SingleProcess AUTOTUNE benchmarking takes 0.2302 seconds and 0.2735 seconds precompiling for 36 choices 2025-12-04T10:04:38.9108167Z Compiled module path: /tmp/tmpql_vaqg4/h3/ch32lgymd5v7nxzv4j63jmv24mbl5ftg76yvqmuxyxoa3aquh4fm.py 2025-12-04T10:04:38.9108466Z Compiled module path: /tmp/tmpql_vaqg4/6d/c6dwrbsr74ysmtciajexo5cs5fl7tyufrrgp5merbbu3ljlmjcnq.py 2025-12-04T10:04:38.9108709Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9108866Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9109000Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9109195Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9109945Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9110643Z graph_break [] 2025-12-04T10:04:38.9110746Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9110907Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9111061Z Autotune Choices Stats: 2025-12-04T10:04:38.9111524Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_728", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T10:04:38.9111966Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9112081Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9112194Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9112514Z triton_mm_728 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9113025Z triton_mm_729 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9113537Z triton_mm_731 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9114059Z triton_mm_724 0.0064 ms 90.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9114576Z triton_mm_739 0.0066 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9115084Z triton_mm_738 0.0067 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9115599Z triton_mm_725 0.0068 ms 85.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9116214Z triton_mm_742 0.0068 ms 85.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9116723Z triton_mm_727 0.0068 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9117227Z triton_mm_736 0.0069 ms 83.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9117629Z SingleProcess AUTOTUNE benchmarking takes 0.2439 seconds and 0.2641 seconds precompiling for 36 choices 2025-12-04T10:04:38.9117935Z Compiled module path: /tmp/tmpe5bqwpig/y7/cy7ub6yjz5s4g4mpdgpjzvuqb7gphmpwcjw42iyms3paha46pj5p.py 2025-12-04T10:04:38.9118276Z Compiled module path: /tmp/tmpe5bqwpig/mr/cmr2bo6u3tcymggebl67fwvvob3ozo7giumu4noxgl3ysc3vpiw2.py 2025-12-04T10:04:38.9118531Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9118695Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9118827Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9119026Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9119803Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9120490Z graph_break [] 2025-12-04T10:04:38.9120601Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9120764Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9120925Z Autotune Choices Stats: 2025-12-04T10:04:38.9121369Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4", "best_time": 0.00675999978557229, "best_triton_pos": 0} 2025-12-04T10:04:38.9121821Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9121939Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9122061Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9122386Z triton_mm_797 0.0068 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9122921Z triton_mm_809 0.0069 ms 97.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9123446Z triton_mm_799 0.0070 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9123959Z triton_mm_800 0.0070 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9124468Z triton_mm_811 0.0070 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9124978Z triton_mm_798 0.0070 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9125490Z triton_mm_801 0.0070 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9126045Z triton_mm_804 0.0070 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9126557Z triton_mm_810 0.0070 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9127079Z triton_mm_803 0.0071 ms 94.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9127485Z SingleProcess AUTOTUNE benchmarking takes 0.2919 seconds and 0.2769 seconds precompiling for 36 choices 2025-12-04T10:04:38.9127810Z Compiled module path: /tmp/tmpc3x23rdt/m2/cm2ipkuwqxr673pwyvpe6owo4f7ce4yxjun54kadsjfrfogc7g6s.py 2025-12-04T10:04:38.9128127Z Compiled module path: /tmp/tmpc3x23rdt/hp/chpxnrjaldzng6pgdpmn7a7wy5uzwzbav7qib3bcar4i4qg3oirs.py 2025-12-04T10:04:38.9128380Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9128539Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9128675Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9128876Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9129638Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9130329Z graph_break [] 2025-12-04T10:04:38.9130435Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9133744Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9133905Z Autotune Choices Stats: 2025-12-04T10:04:38.9134388Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_875", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.0060800001956522465, "best_triton_pos": 0} 2025-12-04T10:04:38.9134855Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9134966Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9135080Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9135403Z triton_mm_875 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9135966Z triton_mm_876 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9136483Z triton_mm_873 0.0062 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9136992Z triton_mm_872 0.0064 ms 94.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9137501Z triton_mm_882 0.0066 ms 92.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9138004Z triton_mm_870 0.0068 ms 89.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9138543Z triton_mm_877 0.0068 ms 88.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9139051Z triton_mm_880 0.0069 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9139571Z triton_mm_871 0.0071 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9140075Z triton_mm_879 0.0074 ms 81.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9140480Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.2699 seconds precompiling for 36 choices 2025-12-04T10:04:38.9140786Z Compiled module path: /tmp/tmper0dfq1d/fr/cfrobu2wovn5222bbaezhuk265vqm6f36xnhuwqjeilejfmtgznq.py 2025-12-04T10:04:38.9141097Z Compiled module path: /tmp/tmper0dfq1d/yw/cywffim3vt25qubrullqtfqomziocax3fhnxcfyvl4m6x6yal42y.py 2025-12-04T10:04:38.9141348Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9141504Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9141637Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9141830Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9142596Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9143295Z graph_break [] 2025-12-04T10:04:38.9143401Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9143561Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9143713Z Autotune Choices Stats: 2025-12-04T10:04:38.9144148Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_940", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T10:04:38.9144592Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9144703Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9144819Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9145138Z triton_mm_940 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9145644Z triton_mm_944 0.0061 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9146188Z triton_mm_948 0.0063 ms 94.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9146696Z triton_mm_941 0.0063 ms 94.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9147225Z triton_mm_943 0.0063 ms 94.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9147743Z triton_mm_955 0.0064 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9148248Z triton_mm_954 0.0066 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9148757Z triton_mm_952 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9149261Z triton_mm_939 0.0069 ms 86.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9149765Z triton_mm_945 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9150169Z SingleProcess AUTOTUNE benchmarking takes 0.2269 seconds and 0.2717 seconds precompiling for 36 choices 2025-12-04T10:04:38.9150473Z Compiled module path: /tmp/tmpwkvnnect/gn/cgnakjn7yvf6krt3othkmflve5cqva5bqk73ozkigagbztvfbdoc.py 2025-12-04T10:04:38.9150803Z Compiled module path: /tmp/tmpwkvnnect/sd/csdsjkyn4kyspkjy7tiggdg7xuhq42zoaok6vbgv57teck2rwmam.py 2025-12-04T10:04:38.9151053Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9151227Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9151357Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9151549Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9152298Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9152979Z graph_break [] 2025-12-04T10:04:38.9153084Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9153242Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9153395Z Autotune Choices Stats: 2025-12-04T10:04:38.9153840Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1021", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.006120000034570694, "best_triton_pos": 0} 2025-12-04T10:04:38.9154288Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9154403Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9154518Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9154847Z triton_mm_1021 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9155373Z triton_mm_1015 0.0066 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9155882Z triton_mm_1013 0.0068 ms 90.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9156459Z triton_mm_1014 0.0070 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9156971Z triton_mm_1017 0.0070 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9157475Z triton_mm_1019 0.0070 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9157978Z triton_mm_1016 0.0070 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9158488Z triton_mm_1020 0.0070 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9159020Z triton_mm_1026 0.0071 ms 86.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9159543Z triton_mm_1027 0.0071 ms 86.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9159944Z SingleProcess AUTOTUNE benchmarking takes 0.2641 seconds and 0.2670 seconds precompiling for 36 choices 2025-12-04T10:04:38.9160251Z Compiled module path: /tmp/tmp81auodu0/bf/cbfiabwhqwjdcvgdirm7kyd4sj5apess24bea7fggnru7txk2xcw.py 2025-12-04T10:04:38.9160567Z Compiled module path: /tmp/tmp81auodu0/yh/cyhryggrrupolf4lgrp55fgb4n56nsechxny4zgss2arxjxhs3rf.py 2025-12-04T10:04:38.9160816Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9160972Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9161104Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9161298Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9162084Z inductor [('triton_bundler_save_kernel', 536), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 47), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 32), ('coordesc_tuning_bench', 18), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1)] 2025-12-04T10:04:38.9162802Z graph_break [] 2025-12-04T10:04:38.9162904Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9163059Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9163229Z Autotune Choices Stats: 2025-12-04T10:04:38.9163666Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1086", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T10:04:38.9164110Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9164221Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9164333Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9164668Z triton_mm_1086 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9165177Z triton_mm_1087 0.0067 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9165690Z triton_mm_1092 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9166237Z triton_mm_1093 0.0068 ms 86.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9166750Z triton_mm_1101 0.0073 ms 80.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9167291Z triton_mm_1089 0.0074 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9167806Z triton_mm_1100 0.0074 ms 79.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9168333Z triton_mm_1091 0.0074 ms 79.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9168839Z triton_mm_1095 0.0075 ms 78.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9169345Z triton_mm_1088 0.0076 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9169747Z SingleProcess AUTOTUNE benchmarking takes 0.2702 seconds and 0.2603 seconds precompiling for 36 choices 2025-12-04T10:04:38.9170055Z Compiled module path: /tmp/tmpral48sit/yp/cyphy65s3ey2wzxuwp6oyu4nli5eryjj2qd5hgow2azsqtmxhhs5.py 2025-12-04T10:04:38.9170369Z Compiled module path: /tmp/tmpral48sit/i3/ci3qcrdhjvu5i3l3dp7pwfj6mipve46faxjqebmeqbk4ngg2tpo4.py 2025-12-04T10:04:38.9170654Z ______ BenchmarkMultiTemplateFusionGpuTest.test_equivalent_template_code _______ 2025-12-04T10:04:38.9170850Z Traceback (most recent call last): 2025-12-04T10:04:38.9171090Z File "/var/lib/jenkins/pytorch/test/inductor/test_benchmark_fusion.py", line 303, in test_equivalent_template_code 2025-12-04T10:04:38.9171359Z ).check("" if config.cpp_wrapper else "return").run(out_code[0]) 2025-12-04T10:04:38.9171573Z RuntimeError: Expected to find "triton_tem_fused_addmm_relu_t_0" but did not find it 2025-12-04T10:04:38.9171768Z Searched string: 2025-12-04T10:04:38.9171875Z with torch.cuda._DeviceGuard(0): 2025-12-04T10:04:38.9172007Z torch.cuda.set_device(0) 2025-12-04T10:04:38.9172160Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float16) 2025-12-04T10:04:38.9172367Z # Topologically Sorted Source Nodes: [a], Original ATen: [aten.t, aten.addmm] 2025-12-04T10:04:38.9172544Z stream0 = get_raw_stream(0) 2025-12-04T10:04:38.9172713Z triton_tem_fused_addmm_t_0.run(arg2_1, arg0_1, buf0, 32, 1, 1, stream=stream0) 2025-12-04T10:04:38.9172892Z del arg0_1 2025-12-04T10:04:38.9172993Z del arg2_1 2025-12-04T10:04:38.9173102Z buf1 = buf0; del buf0 # reuse 2025-12-04T10:04:38.9173285Z # Topologically Sorted Source Nodes: [a, relu], Original ATen: [aten.addmm, aten.relu] 2025-12-04T10:04:38.9173465Z stream0 = get_raw_stream(0) 2025-12-04T10:04:38.9173627Z triton_poi_fused_addmm_relu_1.run(buf1, arg1_1, 65536, stream=stream0) 2025-12-04T10:04:38.9173787Z del arg1_1 2025-12-04T10:04:38.9173887Z return (buf1, ) 2025-12-04T10:04:38.9173951Z 2025-12-04T10:04:38.9174001Z runner = Runner(partitions=[]) 2025-12-04T10:04:38.9174114Z call = runner.call 2025-12-04T10:04:38.9174244Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T10:04:38.9174346Z 2025-12-04T10:04:38.9174347Z 2025-12-04T10:04:38.9174410Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T10:04:38.9174560Z from torch._dynamo.testing import rand_strided 2025-12-04T10:04:38.9174718Z from torch._inductor.utils import print_performance 2025-12-04T10:04:38.9174903Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9175106Z arg1_1 = rand_strided((256, ), (1, ), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9175302Z arg2_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9175487Z fn = lambda: call([arg0_1, arg1_1, arg2_1]) 2025-12-04T10:04:38.9175643Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T10:04:38.9175772Z 2025-12-04T10:04:38.9175774Z 2025-12-04T10:04:38.9175818Z if __name__ == "__main__": 2025-12-04T10:04:38.9176008Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T10:04:38.9176195Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T10:04:38.9176351Z From CHECK: triton_tem_fused_addmm_relu_t_0 2025-12-04T10:04:38.9176437Z 2025-12-04T10:04:38.9176439Z 2025-12-04T10:04:38.9176514Z To execute this test, run the following from the base repo dir: 2025-12-04T10:04:38.9176815Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_benchmark_fusion.py BenchmarkMultiTemplateFusionGpuTest.test_equivalent_template_code 2025-12-04T10:04:38.9177041Z 2025-12-04T10:04:38.9177129Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:04:38.9177334Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9177491Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9177622Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9177816Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9178570Z inductor [('triton_bundler_save_kernel', 448), ('benchmarking.InductorBenchmarker.benchmark_gpu', 52), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 16), ('coordesc_tuning_bench', 7), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9179253Z graph_break [] 2025-12-04T10:04:38.9179374Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9179534Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9180138Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:04:38.9180699Z current_size = base.storage().size() 2025-12-04T10:04:38.9180834Z Autotune Choices Stats: 2025-12-04T10:04:38.9181271Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_7", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T10:04:38.9181714Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9181825Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9181941Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9182260Z triton_mm_7 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9182766Z triton_mm_11 0.0062 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9183264Z triton_mm_8 0.0063 ms 94.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9183771Z triton_mm_6 0.0064 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9184286Z triton_mm_12 0.0067 ms 89.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9184790Z triton_mm_5 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9185289Z triton_mm_9 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9185793Z triton_mm_22 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9186079Z triton_mm_13 0.0070 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9186312Z triton_mm_19 0.0070 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9186442Z SingleProcess AUTOTUNE benchmarking takes 0.2508 seconds and 0.3321 seconds precompiling for 36 choices 2025-12-04T10:04:38.9186585Z Compiled module path: /tmp/tmphbgk0pon/w4/cw4f4q7u3v7sk4jykcksc6x23v3vsehvxi7p7oiyzgltxtm22ko6.py 2025-12-04T10:04:38.9186748Z Compiled module path: /tmp/tmphbgk0pon/bf/cbfsx2nqaiielw4hd7yslddciqzngmnd2m2d4ae552nnm7wiy5f5.py 2025-12-04T10:04:38.9186826Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9186870Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9186931Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9187032Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9187699Z inductor [('triton_bundler_save_kernel', 488), ('benchmarking.InductorBenchmarker.benchmark_gpu', 73), ('async_compile_cache_miss', 47), ('benchmarking.InductorBenchmarker.benchmark', 37), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('coordesc_tuning_bench', 23), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1)] 2025-12-04T10:04:38.9187738Z graph_break [] 2025-12-04T10:04:38.9187787Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9187863Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9187905Z Autotune Choices Stats: 2025-12-04T10:04:38.9188280Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_77", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T10:04:38.9188323Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9188364Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9188411Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9188666Z triton_mm_77 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9188912Z triton_mm_84 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9189145Z triton_mm_88 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9189376Z triton_mm_81 0.0069 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9189608Z triton_mm_79 0.0070 ms 84.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9189838Z triton_mm_83 0.0070 ms 84.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9190071Z triton_mm_85 0.0070 ms 84.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9190303Z triton_mm_91 0.0070 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9190533Z triton_mm_80 0.0071 ms 83.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9190898Z triton_mm_90 0.0071 ms 83.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9191029Z SingleProcess AUTOTUNE benchmarking takes 0.2675 seconds and 0.2612 seconds precompiling for 36 choices 2025-12-04T10:04:38.9191177Z Compiled module path: /tmp/tmple4d_89u/ba/cbago5xr5v2secbwecligdrfh4zzzsojjri3gfceyn2lclaua3wz.py 2025-12-04T10:04:38.9191310Z Compiled module path: /tmp/tmple4d_89u/bw/cbw6qbjuvv6m3ss7vtktie22ioyr2mwuvshkuutjpfcz3uoyzrcj.py 2025-12-04T10:04:38.9191385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9191428Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9191487Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9191591Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9192216Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9192254Z graph_break [] 2025-12-04T10:04:38.9192300Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9192374Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9192415Z Autotune Choices Stats: 2025-12-04T10:04:38.9192801Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_156", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.00595899997279048, "best_triton_pos": 0} 2025-12-04T10:04:38.9192857Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9192897Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9192944Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9193186Z triton_mm_156 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9193420Z triton_mm_157 0.0060 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9193654Z triton_mm_150 0.0065 ms 91.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9193891Z triton_mm_162 0.0066 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9194122Z triton_mm_148 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9194355Z triton_mm_152 0.0069 ms 86.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9194598Z triton_mm_146 0.0069 ms 86.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9194833Z triton_mm_163 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9195080Z triton_mm_166 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9195314Z triton_mm_160 0.0071 ms 83.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9195446Z SingleProcess AUTOTUNE benchmarking takes 0.2364 seconds and 0.2725 seconds precompiling for 36 choices 2025-12-04T10:04:38.9195586Z Compiled module path: /tmp/tmpujpayw3u/6r/c6rppdseuymkfljpvakni5ryayiqjhbdwwu4ki6zlrr63hvc72ie.py 2025-12-04T10:04:38.9195727Z Compiled module path: /tmp/tmpujpayw3u/ij/cijgyswvhqcajl4jrzaivb4upptrtyrigdkzwa5u5gmh64bpx2sa.py 2025-12-04T10:04:38.9195800Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9195844Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9195901Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9196046Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9196683Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9196736Z graph_break [] 2025-12-04T10:04:38.9196783Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9196858Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9196900Z Autotune Choices Stats: 2025-12-04T10:04:38.9197271Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_221", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4", "best_time": 0.006440000142902136, "best_triton_pos": 0} 2025-12-04T10:04:38.9197316Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9197355Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9197403Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9197639Z triton_mm_221 0.0064 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9197877Z triton_mm_228 0.0065 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9198111Z triton_mm_227 0.0066 ms 98.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9198371Z triton_mm_223 0.0066 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9198606Z triton_mm_217 0.0068 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=32, BLOCK_N=16, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=2 2025-12-04T10:04:38.9198856Z triton_mm_219 0.0068 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9199086Z triton_mm_222 0.0068 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9199319Z triton_mm_229 0.0069 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9199551Z triton_mm_224 0.0070 ms 92.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9199782Z triton_mm_220 0.0070 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9199914Z SingleProcess AUTOTUNE benchmarking takes 0.2322 seconds and 0.2645 seconds precompiling for 36 choices 2025-12-04T10:04:38.9200055Z Compiled module path: /tmp/tmplqztcoos/c3/cc3vogulhiuqlbae6kejcpo5jtkrdbi67rnpmmy4y36lsooku3k7.py 2025-12-04T10:04:38.9200207Z Compiled module path: /tmp/tmplqztcoos/m3/cm3qvnsjgpjjt5hgtdu5whbh22cnjz2xnkuhttjsedptbmlzsnci.py 2025-12-04T10:04:38.9200293Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9200336Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9200394Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9200493Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9201109Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9201147Z graph_break [] 2025-12-04T10:04:38.9201196Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9201270Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9201312Z Autotune Choices Stats: 2025-12-04T10:04:38.9201681Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_296", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.006399999838322401, "best_triton_pos": 0} 2025-12-04T10:04:38.9201724Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9201764Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9201809Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9202049Z triton_mm_296 0.0064 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9202292Z triton_mm_295 0.0065 ms 98.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9202536Z triton_mm_297 0.0066 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9202766Z triton_mm_299 0.0066 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9203000Z triton_mm_301 0.0066 ms 96.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9203234Z triton_mm_306 0.0067 ms 95.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9203467Z triton_mm_293 0.0068 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9203701Z triton_mm_305 0.0069 ms 93.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9203944Z triton_mm_294 0.0069 ms 92.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9204189Z triton_mm_309 0.0072 ms 88.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9204318Z SingleProcess AUTOTUNE benchmarking takes 0.2584 seconds and 0.2697 seconds precompiling for 36 choices 2025-12-04T10:04:38.9204457Z Compiled module path: /tmp/tmpdomlhjox/mc/cmcwujd5cizd4scnlcqifabifo5v5umjcgkbbpca7sn7rh76b2pf.py 2025-12-04T10:04:38.9204593Z Compiled module path: /tmp/tmpdomlhjox/b7/cb7ygmw7hpgnsj5eivnjg2ahmxqkaxd2nxdhlybrabuzi3gablpe.py 2025-12-04T10:04:38.9204667Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9204710Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9204769Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9204868Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9205483Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9205523Z graph_break [] 2025-12-04T10:04:38.9205570Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9205645Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9205697Z Autotune Choices Stats: 2025-12-04T10:04:38.9206116Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_369", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.00583899999037385, "best_triton_pos": 0} 2025-12-04T10:04:38.9206159Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9206199Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9206245Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9206502Z triton_mm_369 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9206735Z triton_mm_366 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9206971Z triton_mm_379 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9207208Z triton_mm_367 0.0064 ms 91.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9207442Z triton_mm_373 0.0066 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9207688Z triton_mm_368 0.0067 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9207921Z triton_mm_372 0.0068 ms 86.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9208175Z triton_mm_378 0.0069 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9208412Z triton_mm_382 0.0070 ms 83.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9208648Z triton_mm_377 0.0071 ms 82.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9208780Z SingleProcess AUTOTUNE benchmarking takes 0.2535 seconds and 0.2696 seconds precompiling for 36 choices 2025-12-04T10:04:38.9208920Z Compiled module path: /tmp/tmprfi6eta4/jn/cjnxgjerka6xvgwxgssgtk4pjsfrtpl4q2zoeknjdhbqu4r7i3w2.py 2025-12-04T10:04:38.9209056Z Compiled module path: /tmp/tmprfi6eta4/jq/cjqcixbt64by5nhzholfgtbeoclbevma4msrl43l63jnlycw5kdc.py 2025-12-04T10:04:38.9209128Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9209173Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9209230Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9209331Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9209947Z inductor [('triton_bundler_save_kernel', 456), ('benchmarking.InductorBenchmarker.benchmark_gpu', 53), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 17), ('coordesc_tuning_bench', 8), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9210006Z graph_break [] 2025-12-04T10:04:38.9210052Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9210143Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9210185Z Autotune Choices Stats: 2025-12-04T10:04:38.9210557Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_439", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.0060800001956522465, "best_triton_pos": 0} 2025-12-04T10:04:38.9210605Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9210644Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9210692Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9210925Z triton_mm_439 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9211158Z triton_mm_437 0.0062 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9211391Z triton_mm_448 0.0069 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9211652Z triton_mm_449 0.0069 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9211895Z triton_mm_441 0.0070 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9212128Z triton_mm_444 0.0070 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9212358Z triton_mm_438 0.0070 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9212591Z triton_mm_440 0.0070 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9212822Z triton_mm_443 0.0070 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9213055Z triton_mm_450 0.0071 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9213185Z SingleProcess AUTOTUNE benchmarking takes 0.2656 seconds and 0.2557 seconds precompiling for 36 choices 2025-12-04T10:04:38.9213319Z Compiled module path: /tmp/tmpiy_c81rd/4r/c4rkjb26sea3bukicooquz5gpwvhhyzs4ilgnntffmfie5m4e7vj.py 2025-12-04T10:04:38.9213465Z Compiled module path: /tmp/tmpiy_c81rd/7h/c7h5slbwz4bgun22psqksryrnky263lpgiwdpwilovbiegbv57da.py 2025-12-04T10:04:38.9213540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9213581Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9213638Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9213736Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9214366Z inductor [('triton_bundler_save_kernel', 432), ('benchmarking.InductorBenchmarker.benchmark_gpu', 50), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 14), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9214405Z graph_break [] 2025-12-04T10:04:38.9214453Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9214529Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9214570Z Autotune Choices Stats: 2025-12-04T10:04:38.9214938Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_511", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.006519999820739031, "best_triton_pos": 0} 2025-12-04T10:04:38.9214981Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9215021Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9215068Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9215312Z triton_mm_511 0.0065 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9215555Z triton_mm_520 0.0069 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9215787Z triton_mm_512 0.0070 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9216057Z triton_mm_510 0.0070 ms 92.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9216290Z triton_mm_508 0.0076 ms 86.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9216524Z triton_mm_513 0.0076 ms 86.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9216758Z triton_mm_519 0.0076 ms 86.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9216988Z triton_mm_517 0.0077 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9217243Z triton_mm_525 0.0077 ms 84.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9217477Z triton_mm_523 0.0078 ms 83.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9217607Z SingleProcess AUTOTUNE benchmarking takes 0.6616 seconds and 0.2698 seconds precompiling for 36 choices 2025-12-04T10:04:38.9217757Z Compiled module path: /tmp/tmpxt1vai9i/5w/c5wwu32cp233qqzv5dpouzaq4su75rkycnetna6tch46zq52fijk.py 2025-12-04T10:04:38.9217892Z Compiled module path: /tmp/tmpxt1vai9i/ul/cul5r4u6mig6mua5yzuigwwlhqkaevnlnnoz4omja5z45ojvaizf.py 2025-12-04T10:04:38.9217968Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9218011Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9218071Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9218170Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9218791Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9218830Z graph_break [] 2025-12-04T10:04:38.9218875Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9218949Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9218990Z Autotune Choices Stats: 2025-12-04T10:04:38.9219374Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_595", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T10:04:38.9219430Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9219470Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9219515Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9219755Z triton_mm_595 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9219994Z triton_mm_585 0.0062 ms 95.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9220232Z triton_mm_587 0.0064 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9220467Z triton_mm_584 0.0065 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9220700Z triton_mm_582 0.0066 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9220935Z triton_mm_589 0.0066 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9221177Z triton_mm_594 0.0067 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9221425Z triton_mm_592 0.0068 ms 87.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9221658Z triton_mm_593 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9221892Z triton_mm_598 0.0070 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9222023Z SingleProcess AUTOTUNE benchmarking takes 0.2574 seconds and 0.2680 seconds precompiling for 36 choices 2025-12-04T10:04:38.9222161Z Compiled module path: /tmp/tmp6audcjbe/zk/czkdqv2xbklr4p4dpjneas4fzxmjzd6odn7aavh4nwxtaztq6ock.py 2025-12-04T10:04:38.9222297Z Compiled module path: /tmp/tmp6audcjbe/kl/cklltvqedtgj6pyk5aaedy3c774zevfwxq7oxy6o53pxikluowui.py 2025-12-04T10:04:38.9222371Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9222414Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9222470Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9222570Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9223196Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9223247Z graph_break [] 2025-12-04T10:04:38.9223294Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9223369Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9223408Z Autotune Choices Stats: 2025-12-04T10:04:38.9223778Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_660", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T10:04:38.9223822Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9223863Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9223910Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9224153Z triton_mm_660 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9224389Z triton_mm_655 0.0061 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9224622Z triton_mm_661 0.0061 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9224873Z triton_mm_656 0.0064 ms 92.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9225105Z triton_mm_653 0.0064 ms 91.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9225349Z triton_mm_657 0.0064 ms 91.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9225579Z triton_mm_654 0.0066 ms 90.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9225810Z triton_mm_659 0.0066 ms 90.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9226091Z triton_mm_666 0.0066 ms 89.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9226326Z triton_mm_667 0.0067 ms 88.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9226456Z SingleProcess AUTOTUNE benchmarking takes 0.2302 seconds and 0.2735 seconds precompiling for 36 choices 2025-12-04T10:04:38.9226587Z Compiled module path: /tmp/tmpql_vaqg4/h3/ch32lgymd5v7nxzv4j63jmv24mbl5ftg76yvqmuxyxoa3aquh4fm.py 2025-12-04T10:04:38.9226739Z Compiled module path: /tmp/tmpql_vaqg4/6d/c6dwrbsr74ysmtciajexo5cs5fl7tyufrrgp5merbbu3ljlmjcnq.py 2025-12-04T10:04:38.9226828Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9226871Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9226928Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9227029Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9227645Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9227683Z graph_break [] 2025-12-04T10:04:38.9227730Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9227803Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9227844Z Autotune Choices Stats: 2025-12-04T10:04:38.9228215Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_728", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T10:04:38.9228260Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9228299Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9228345Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9228582Z triton_mm_728 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9228832Z triton_mm_729 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9229079Z triton_mm_731 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9229309Z triton_mm_724 0.0064 ms 90.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9229543Z triton_mm_739 0.0066 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9229775Z triton_mm_738 0.0067 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9230007Z triton_mm_725 0.0068 ms 85.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9230244Z triton_mm_742 0.0068 ms 85.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9230488Z triton_mm_727 0.0068 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9230731Z triton_mm_736 0.0069 ms 83.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9230861Z SingleProcess AUTOTUNE benchmarking takes 0.2439 seconds and 0.2641 seconds precompiling for 36 choices 2025-12-04T10:04:38.9230999Z Compiled module path: /tmp/tmpe5bqwpig/y7/cy7ub6yjz5s4g4mpdgpjzvuqb7gphmpwcjw42iyms3paha46pj5p.py 2025-12-04T10:04:38.9231133Z Compiled module path: /tmp/tmpe5bqwpig/mr/cmr2bo6u3tcymggebl67fwvvob3ozo7giumu4noxgl3ysc3vpiw2.py 2025-12-04T10:04:38.9231207Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9231249Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9231307Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9231406Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9232025Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9232061Z graph_break [] 2025-12-04T10:04:38.9232109Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9232185Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9232241Z Autotune Choices Stats: 2025-12-04T10:04:38.9232613Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_797", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4", "best_time": 0.00675999978557229, "best_triton_pos": 0} 2025-12-04T10:04:38.9232655Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9232697Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9232753Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9232992Z triton_mm_797 0.0068 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9233226Z triton_mm_809 0.0069 ms 97.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9233459Z triton_mm_799 0.0070 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9233688Z triton_mm_800 0.0070 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9233921Z triton_mm_811 0.0070 ms 97.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9234162Z triton_mm_798 0.0070 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9234409Z triton_mm_801 0.0070 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9234642Z triton_mm_804 0.0070 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9234875Z triton_mm_810 0.0070 ms 96.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9235108Z triton_mm_803 0.0071 ms 94.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9235238Z SingleProcess AUTOTUNE benchmarking takes 0.2919 seconds and 0.2769 seconds precompiling for 36 choices 2025-12-04T10:04:38.9235375Z Compiled module path: /tmp/tmpc3x23rdt/m2/cm2ipkuwqxr673pwyvpe6owo4f7ce4yxjun54kadsjfrfogc7g6s.py 2025-12-04T10:04:38.9235515Z Compiled module path: /tmp/tmpc3x23rdt/hp/chpxnrjaldzng6pgdpmn7a7wy5uzwzbav7qib3bcar4i4qg3oirs.py 2025-12-04T10:04:38.9235589Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9235633Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9235691Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9235793Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9236486Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9236542Z graph_break [] 2025-12-04T10:04:38.9236588Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9236678Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9236718Z Autotune Choices Stats: 2025-12-04T10:04:38.9237093Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_875", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.0060800001956522465, "best_triton_pos": 0} 2025-12-04T10:04:38.9237136Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9237177Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9237223Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9237458Z triton_mm_875 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9237695Z triton_mm_876 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9237925Z triton_mm_873 0.0062 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9238174Z triton_mm_872 0.0064 ms 94.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9238430Z triton_mm_882 0.0066 ms 92.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9238663Z triton_mm_870 0.0068 ms 89.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9238896Z triton_mm_877 0.0068 ms 88.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9239132Z triton_mm_880 0.0069 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9239366Z triton_mm_871 0.0071 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9239598Z triton_mm_879 0.0074 ms 81.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9239729Z SingleProcess AUTOTUNE benchmarking takes 0.2643 seconds and 0.2699 seconds precompiling for 36 choices 2025-12-04T10:04:38.9239877Z Compiled module path: /tmp/tmper0dfq1d/fr/cfrobu2wovn5222bbaezhuk265vqm6f36xnhuwqjeilejfmtgznq.py 2025-12-04T10:04:38.9240014Z Compiled module path: /tmp/tmper0dfq1d/yw/cywffim3vt25qubrullqtfqomziocax3fhnxcfyvl4m6x6yal42y.py 2025-12-04T10:04:38.9240089Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9240133Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9240189Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9240289Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9240921Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9240964Z graph_break [] 2025-12-04T10:04:38.9241010Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9241083Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9241126Z Autotune Choices Stats: 2025-12-04T10:04:38.9241495Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_940", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T10:04:38.9241538Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9241577Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9241625Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9241873Z triton_mm_940 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9242117Z triton_mm_944 0.0061 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9242354Z triton_mm_948 0.0063 ms 94.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9242592Z triton_mm_941 0.0063 ms 94.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9242827Z triton_mm_943 0.0063 ms 94.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9243059Z triton_mm_955 0.0064 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9243294Z triton_mm_954 0.0066 ms 90.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9243527Z triton_mm_952 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9243772Z triton_mm_939 0.0069 ms 86.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9244003Z triton_mm_945 0.0070 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9244146Z SingleProcess AUTOTUNE benchmarking takes 0.2269 seconds and 0.2717 seconds precompiling for 36 choices 2025-12-04T10:04:38.9244285Z Compiled module path: /tmp/tmpwkvnnect/gn/cgnakjn7yvf6krt3othkmflve5cqva5bqk73ozkigagbztvfbdoc.py 2025-12-04T10:04:38.9244421Z Compiled module path: /tmp/tmpwkvnnect/sd/csdsjkyn4kyspkjy7tiggdg7xuhq42zoaok6vbgv57teck2rwmam.py 2025-12-04T10:04:38.9244495Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9244539Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9244601Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9244701Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9245318Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9245356Z graph_break [] 2025-12-04T10:04:38.9245404Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9245478Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9245537Z Autotune Choices Stats: 2025-12-04T10:04:38.9245909Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1021", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.006120000034570694, "best_triton_pos": 0} 2025-12-04T10:04:38.9246004Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9246045Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9246094Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9246335Z triton_mm_1021 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9246570Z triton_mm_1015 0.0066 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9246809Z triton_mm_1013 0.0068 ms 90.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9247042Z triton_mm_1014 0.0070 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9247277Z triton_mm_1017 0.0070 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9247510Z triton_mm_1019 0.0070 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9247764Z triton_mm_1016 0.0070 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9248016Z triton_mm_1020 0.0070 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9248250Z triton_mm_1026 0.0071 ms 86.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9248489Z triton_mm_1027 0.0071 ms 86.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9248620Z SingleProcess AUTOTUNE benchmarking takes 0.2641 seconds and 0.2670 seconds precompiling for 36 choices 2025-12-04T10:04:38.9248762Z Compiled module path: /tmp/tmp81auodu0/bf/cbfiabwhqwjdcvgdirm7kyd4sj5apess24bea7fggnru7txk2xcw.py 2025-12-04T10:04:38.9248898Z Compiled module path: /tmp/tmp81auodu0/yh/cyhryggrrupolf4lgrp55fgb4n56nsechxny4zgss2arxjxhs3rf.py 2025-12-04T10:04:38.9248976Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9249018Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9249079Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9249178Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9249848Z inductor [('triton_bundler_save_kernel', 536), ('benchmarking.InductorBenchmarker.benchmark_gpu', 68), ('async_compile_cache_miss', 47), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 32), ('coordesc_tuning_bench', 18), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1)] 2025-12-04T10:04:38.9249907Z graph_break [] 2025-12-04T10:04:38.9249955Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9250031Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9250072Z Autotune Choices Stats: 2025-12-04T10:04:38.9250443Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1086", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T10:04:38.9250488Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9250529Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9250575Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9250815Z triton_mm_1086 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9251050Z triton_mm_1087 0.0067 ms 87.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9251286Z triton_mm_1092 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9251542Z triton_mm_1093 0.0068 ms 86.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9251788Z triton_mm_1101 0.0073 ms 80.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9252024Z triton_mm_1089 0.0074 ms 79.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9252259Z triton_mm_1100 0.0074 ms 79.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9252493Z triton_mm_1091 0.0074 ms 79.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9252732Z triton_mm_1095 0.0075 ms 78.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9252965Z triton_mm_1088 0.0076 ms 77.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9253096Z SingleProcess AUTOTUNE benchmarking takes 0.2702 seconds and 0.2603 seconds precompiling for 36 choices 2025-12-04T10:04:38.9253254Z Compiled module path: /tmp/tmpral48sit/yp/cyphy65s3ey2wzxuwp6oyu4nli5eryjj2qd5hgow2azsqtmxhhs5.py 2025-12-04T10:04:38.9253407Z Compiled module path: /tmp/tmpral48sit/i3/ci3qcrdhjvu5i3l3dp7pwfj6mipve46faxjqebmeqbk4ngg2tpo4.py 2025-12-04T10:04:38.9253481Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9253525Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9253581Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9253683Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9254301Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9254343Z graph_break [] 2025-12-04T10:04:38.9254390Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9254467Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9254507Z Autotune Choices Stats: 2025-12-04T10:04:38.9254881Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1160", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.005679000169038773, "best_triton_pos": 0} 2025-12-04T10:04:38.9254927Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9254966Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9255026Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9255265Z triton_mm_1160 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9255499Z triton_mm_1156 0.0058 ms 97.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9255742Z triton_mm_1159 0.0059 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9256027Z triton_mm_1171 0.0064 ms 89.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9256258Z triton_mm_1163 0.0065 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9256490Z triton_mm_1158 0.0066 ms 86.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9256726Z triton_mm_1168 0.0068 ms 83.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9256983Z triton_mm_1170 0.0072 ms 78.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9257220Z triton_mm_1166 0.0073 ms 77.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9257466Z triton_mm_1165 0.0073 ms 77.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9257599Z SingleProcess AUTOTUNE benchmarking takes 0.2492 seconds and 0.2609 seconds precompiling for 36 choices 2025-12-04T10:04:38.9257739Z Compiled module path: /tmp/tmpylydix3m/wk/cwkqk3jqrtp7p5ywfw7yugvf4myg4cnqixfpvaio5noxrl2o2nuf.py 2025-12-04T10:04:38.9257876Z Compiled module path: /tmp/tmpylydix3m/rc/crckgdsx3jauj2f4a5stiyrm4gdsuedvebu5xx4jkjegsikajz5k.py 2025-12-04T10:04:38.9257954Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9257997Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9258058Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9258158Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9258774Z inductor [('triton_bundler_save_kernel', 432), ('benchmarking.InductorBenchmarker.benchmark_gpu', 50), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 14), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9258812Z graph_break [] 2025-12-04T10:04:38.9258858Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9258947Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9258989Z Autotune Choices Stats: 2025-12-04T10:04:38.9259362Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T10:04:38.9259405Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9259458Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9259509Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9259749Z triton_mm_1237 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9259984Z triton_mm_1230 0.0060 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9260218Z triton_mm_1232 0.0062 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9260449Z triton_mm_1231 0.0064 ms 91.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9260683Z triton_mm_1236 0.0065 ms 91.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9260930Z triton_mm_1233 0.0065 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9261173Z triton_mm_1235 0.0065 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9261411Z triton_mm_1241 0.0070 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9261646Z triton_mm_1243 0.0073 ms 80.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9261884Z triton_mm_1229 0.0073 ms 80.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9262014Z SingleProcess AUTOTUNE benchmarking takes 0.2644 seconds and 0.2713 seconds precompiling for 36 choices 2025-12-04T10:04:38.9262151Z Compiled module path: /tmp/tmpzack_54y/gp/cgptgjepklb4feakmxcmabr6w6geszcnbzoste2p4g76frh2ioes.py 2025-12-04T10:04:38.9262284Z Compiled module path: /tmp/tmpzack_54y/n3/cn3ewdtwxkxnzcbeyiyqgr6gt4yijsflmzogsgeqtdv2g5rxbdmp.py 2025-12-04T10:04:38.9262360Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9262402Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9262461Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9262560Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9263194Z inductor [('triton_bundler_save_kernel', 480), ('benchmarking.InductorBenchmarker.benchmark_gpu', 56), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 20), ('coordesc_tuning_bench', 11), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9263244Z graph_break [] 2025-12-04T10:04:38.9263291Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9263368Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9263412Z Autotune Choices Stats: 2025-12-04T10:04:38.9263787Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1304", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.005799000151455402, "best_triton_pos": 0} 2025-12-04T10:04:38.9263831Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9263873Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9263919Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9264156Z triton_mm_1304 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9264390Z triton_mm_1303 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9264638Z triton_mm_1307 0.0059 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9264882Z triton_mm_1305 0.0060 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9265121Z triton_mm_1301 0.0062 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9265359Z triton_mm_1315 0.0062 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9265593Z triton_mm_1309 0.0063 ms 91.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9265828Z triton_mm_1300 0.0066 ms 88.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9266115Z triton_mm_1308 0.0066 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9266352Z triton_mm_1314 0.0067 ms 86.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9266500Z SingleProcess AUTOTUNE benchmarking takes 0.2621 seconds and 0.2591 seconds precompiling for 36 choices 2025-12-04T10:04:38.9266636Z Compiled module path: /tmp/tmpkep9ws28/74/c7442npvpdsgghxuxg33bsoflzdrrus5r5c7xhzpmo7fwrkw7zqf.py 2025-12-04T10:04:38.9266774Z Compiled module path: /tmp/tmpkep9ws28/lq/clqx3oxw4wqtiwmpwvzplk7odj5bn6di7xwhc5vfsyy7as3agksx.py 2025-12-04T10:04:38.9266849Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9266892Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9266949Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9267067Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9267681Z inductor [('triton_bundler_save_kernel', 488), ('benchmarking.InductorBenchmarker.benchmark_gpu', 57), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 21), ('coordesc_tuning_bench', 12), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9267722Z graph_break [] 2025-12-04T10:04:38.9267770Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9267848Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9267888Z Autotune Choices Stats: 2025-12-04T10:04:38.9268259Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1380", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T10:04:38.9268305Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9268347Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9268408Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9268647Z triton_mm_1380 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9268894Z triton_mm_1372 0.0059 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9269134Z triton_mm_1369 0.0061 ms 95.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=32, BLOCK_N=16, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=2 2025-12-04T10:04:38.9269366Z triton_mm_1374 0.0061 ms 94.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9269597Z triton_mm_1376 0.0063 ms 92.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9269834Z triton_mm_1377 0.0063 ms 91.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9270065Z triton_mm_1370 0.0065 ms 89.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9270297Z triton_mm_1379 0.0066 ms 88.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9270544Z triton_mm_1381 0.0066 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9270792Z triton_mm_1373 0.0066 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9270924Z SingleProcess AUTOTUNE benchmarking takes 0.2080 seconds and 0.2524 seconds precompiling for 36 choices 2025-12-04T10:04:38.9271058Z Compiled module path: /tmp/tmpo6vp4vin/3c/c3c6qmy5icyyg4mcfhrpbocvutvkb2gq722zqc6tj6wbyvbounwk.py 2025-12-04T10:04:38.9271193Z Compiled module path: /tmp/tmpo6vp4vin/3g/c3gh6uhwpad2cczoteic46lesdcaobdeohycb5pt2zyou633fngq.py 2025-12-04T10:04:38.9271269Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9271314Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9271370Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9271470Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9272087Z inductor [('triton_bundler_save_kernel', 472), ('benchmarking.InductorBenchmarker.benchmark_gpu', 55), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 19), ('coordesc_tuning_bench', 10), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9272125Z graph_break [] 2025-12-04T10:04:38.9272174Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9272260Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9272316Z Autotune Choices Stats: 2025-12-04T10:04:38.9272686Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1446", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005799999926239252, "best_triton_pos": 0} 2025-12-04T10:04:38.9272729Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9272768Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9272815Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9273054Z triton_mm_1446 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9273289Z triton_mm_1447 0.0062 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9273522Z triton_mm_1448 0.0065 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9273756Z triton_mm_1452 0.0065 ms 89.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9273990Z triton_mm_1443 0.0066 ms 88.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9274233Z triton_mm_1449 0.0066 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9274466Z triton_mm_1451 0.0067 ms 86.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9274709Z triton_mm_1445 0.0068 ms 85.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9274946Z triton_mm_1444 0.0068 ms 84.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9275187Z triton_mm_1459 0.0072 ms 81.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9275318Z SingleProcess AUTOTUNE benchmarking takes 0.2515 seconds and 0.3216 seconds precompiling for 36 choices 2025-12-04T10:04:38.9275459Z Compiled module path: /tmp/tmpqtaewbca/va/cvauei2bko57ln27dpjexfnta5nabtnm5gskako62m6agbajq6fq.py 2025-12-04T10:04:38.9275599Z Compiled module path: /tmp/tmpqtaewbca/au/caufaqigqwkrea5aemfddhnfe7fsdqhgma35vtmmxifwtdx4qtsi.py 2025-12-04T10:04:38.9275676Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9275718Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9275776Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9275876Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9276558Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9276611Z graph_break [] 2025-12-04T10:04:38.9276660Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9276733Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9276775Z Autotune Choices Stats: 2025-12-04T10:04:38.9277148Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1525", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T10:04:38.9277191Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9277231Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9277277Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9277520Z triton_mm_1525 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9277753Z triton_mm_1519 0.0061 ms 98.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9278003Z triton_mm_1531 0.0065 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9278240Z triton_mm_1524 0.0066 ms 91.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9278487Z triton_mm_1518 0.0067 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9278719Z triton_mm_1520 0.0068 ms 88.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9278952Z triton_mm_1523 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9279189Z triton_mm_1530 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9279426Z triton_mm_1529 0.0070 ms 85.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9279664Z triton_mm_1534 0.0071 ms 84.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9279806Z SingleProcess AUTOTUNE benchmarking takes 0.2512 seconds and 0.3044 seconds precompiling for 36 choices 2025-12-04T10:04:38.9279952Z Compiled module path: /tmp/tmpls_sl6lu/ap/capilc3ui3mq4uzue2xgfblkxurd5cl7v3nd2xijuxxqllyobjhq.py 2025-12-04T10:04:38.9280086Z Compiled module path: /tmp/tmpls_sl6lu/4l/c4l5pggfupje3yiwopoejmfdvpben3np7pc3mhwsz3tlzovdsiiy.py 2025-12-04T10:04:38.9280159Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9280203Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9280261Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9280362Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9280975Z inductor [('triton_bundler_save_kernel', 456), ('benchmarking.InductorBenchmarker.benchmark_gpu', 53), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 17), ('coordesc_tuning_bench', 8), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9281016Z graph_break [] 2025-12-04T10:04:38.9281062Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9281139Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9281179Z Autotune Choices Stats: 2025-12-04T10:04:38.9281554Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1592", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.005760000087320805, "best_triton_pos": 0} 2025-12-04T10:04:38.9281615Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9281656Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9281702Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9281942Z triton_mm_1592 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9282188Z triton_mm_1589 0.0060 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9282422Z triton_mm_1597 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9282659Z triton_mm_1596 0.0063 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9282894Z triton_mm_1593 0.0064 ms 90.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9283130Z triton_mm_1591 0.0067 ms 86.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9283364Z triton_mm_1600 0.0067 ms 85.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9283606Z triton_mm_1595 0.0069 ms 83.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9283853Z triton_mm_1605 0.0069 ms 83.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9284090Z triton_mm_1601 0.0070 ms 82.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9284221Z SingleProcess AUTOTUNE benchmarking takes 0.2499 seconds and 0.2734 seconds precompiling for 36 choices 2025-12-04T10:04:38.9284357Z Compiled module path: /tmp/tmpdpneu7dg/of/cofjmiaxacogimv4sy3f2rodd4dgj3mecd32horr7lbgfm2wdvwm.py 2025-12-04T10:04:38.9284494Z Compiled module path: /tmp/tmpdpneu7dg/jx/cjx5em7ax7mwjeujwsy75deek6z77vjo4bsv5u7e45dawcdnbfci.py 2025-12-04T10:04:38.9284567Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9284612Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9284669Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9284769Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9285382Z inductor [('triton_bundler_save_kernel', 432), ('benchmarking.InductorBenchmarker.benchmark_gpu', 50), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 14), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9285438Z graph_break [] 2025-12-04T10:04:38.9285486Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9285559Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9285604Z Autotune Choices Stats: 2025-12-04T10:04:38.9286047Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1664", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T10:04:38.9286089Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9286128Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9286176Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9286414Z triton_mm_1664 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9286648Z triton_mm_1659 0.0059 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9286882Z triton_mm_1661 0.0060 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9287118Z triton_mm_1669 0.0064 ms 91.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9287366Z triton_mm_1660 0.0064 ms 91.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9287599Z triton_mm_1663 0.0064 ms 91.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9287851Z triton_mm_1665 0.0065 ms 90.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9288084Z triton_mm_1667 0.0066 ms 88.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9288321Z triton_mm_1674 0.0067 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9288557Z triton_mm_1672 0.0068 ms 86.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9288686Z SingleProcess AUTOTUNE benchmarking takes 0.2293 seconds and 0.3648 seconds precompiling for 36 choices 2025-12-04T10:04:38.9288823Z Compiled module path: /tmp/tmpp0ea_gon/so/csofcx6q2qbnh4mokcotc37zwupyi2vzajv4to4bbhr2lnizi2y5.py 2025-12-04T10:04:38.9288954Z Compiled module path: /tmp/tmpp0ea_gon/j6/cj6vxjt7z4twgm6z6llc6ch6xg5u5ywlkpupo6of4uqzqv37v5x3.py 2025-12-04T10:04:38.9289031Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9289073Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9289132Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9289250Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9289880Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9289918Z graph_break [] 2025-12-04T10:04:38.9289966Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9290039Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9290082Z Autotune Choices Stats: 2025-12-04T10:04:38.9290454Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1731", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005640000104904175, "best_triton_pos": 0} 2025-12-04T10:04:38.9290501Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9290541Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9290587Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9290826Z triton_mm_1731 0.0056 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9291061Z triton_mm_1736 0.0061 ms 92.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9291308Z triton_mm_1733 0.0062 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9291550Z triton_mm_1735 0.0062 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9291788Z triton_mm_1747 0.0064 ms 88.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9292021Z triton_mm_1739 0.0066 ms 85.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9292255Z triton_mm_1732 0.0067 ms 83.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9292491Z triton_mm_1741 0.0068 ms 82.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9292725Z triton_mm_1746 0.0068 ms 82.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9292963Z triton_mm_1744 0.0068 ms 82.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9293107Z SingleProcess AUTOTUNE benchmarking takes 0.2270 seconds and 0.3114 seconds precompiling for 36 choices 2025-12-04T10:04:38.9293247Z Compiled module path: /tmp/tmp3mjjb8_q/lq/clqymn4un5v7xfsnddym5x7pvgw5qjgqu2zn43rzzpok7feyuo7w.py 2025-12-04T10:04:38.9293383Z Compiled module path: /tmp/tmp3mjjb8_q/pm/cpmjhhulai6k75zx32wwlshpd7tllih2gsafofcfc2qwaz7thszs.py 2025-12-04T10:04:38.9293459Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9293502Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9293571Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9293672Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9294293Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9294335Z graph_break [] 2025-12-04T10:04:38.9294381Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9294458Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9294500Z Autotune Choices Stats: 2025-12-04T10:04:38.9294871Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1819", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005919999908655882, "best_triton_pos": 0} 2025-12-04T10:04:38.9294913Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9294965Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9295028Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9295273Z triton_mm_1819 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9295509Z triton_mm_1805 0.0062 ms 94.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9295745Z triton_mm_1809 0.0062 ms 94.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9296032Z triton_mm_1806 0.0063 ms 94.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9296263Z triton_mm_1811 0.0064 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9296496Z triton_mm_1808 0.0064 ms 92.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9296726Z triton_mm_1807 0.0064 ms 91.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9296984Z triton_mm_1816 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9297224Z triton_mm_1818 0.0069 ms 86.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9297476Z triton_mm_1817 0.0070 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9297609Z SingleProcess AUTOTUNE benchmarking takes 0.2522 seconds and 0.3744 seconds precompiling for 36 choices 2025-12-04T10:04:38.9297746Z Compiled module path: /tmp/tmpwecqpcl_/og/cogm4y6pyfqh5qlhqzehsm7wtnpyjhoqpzmpcescrdpi2hycviro.py 2025-12-04T10:04:38.9297882Z Compiled module path: /tmp/tmpwecqpcl_/4u/c4ubxuilxbe5vh3qpckkokcpynrjflb2jrikf2e24hqmcjfcjubz.py 2025-12-04T10:04:38.9297957Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9298002Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9298059Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9298160Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9298775Z inductor [('triton_bundler_save_kernel', 536), ('benchmarking.InductorBenchmarker.benchmark_gpu', 63), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 27), ('coordesc_tuning_bench', 18), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9298829Z graph_break [] 2025-12-04T10:04:38.9298877Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9298967Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9299008Z Autotune Choices Stats: 2025-12-04T10:04:38.9299387Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1881", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.00571999978274107, "best_triton_pos": 0} 2025-12-04T10:04:38.9299431Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9299471Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9299518Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9299757Z triton_mm_1881 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9299996Z triton_mm_1884 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9300231Z triton_mm_1877 0.0062 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9300470Z triton_mm_1879 0.0066 ms 86.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9300706Z triton_mm_1885 0.0067 ms 85.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9300950Z triton_mm_1883 0.0067 ms 85.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9301198Z triton_mm_1888 0.0068 ms 84.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9301439Z triton_mm_1894 0.0069 ms 83.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9301678Z triton_mm_1889 0.0069 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9301912Z triton_mm_1880 0.0072 ms 79.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9302045Z SingleProcess AUTOTUNE benchmarking takes 0.2638 seconds and 0.4207 seconds precompiling for 36 choices 2025-12-04T10:04:38.9302184Z Compiled module path: /tmp/tmpkl8te86_/i6/ci6lwcsq7idvtacytbzzrzbnab56mzgboqfau3uibmxvapglkst2.py 2025-12-04T10:04:38.9302318Z Compiled module path: /tmp/tmpkl8te86_/nv/cnvdolvjoljfpkm7jf7fk5t3kxlfwv6p5clqhtoph5a5gwd5gwfa.py 2025-12-04T10:04:38.9302395Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9302437Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9302497Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9302607Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9303236Z inductor [('triton_bundler_save_kernel', 488), ('benchmarking.InductorBenchmarker.benchmark_gpu', 57), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 21), ('coordesc_tuning_bench', 12), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9303275Z graph_break [] 2025-12-04T10:04:38.9303326Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9303400Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9303445Z Autotune Choices Stats: 2025-12-04T10:04:38.9303817Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_1957", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.005840000230818987, "best_triton_pos": 0} 2025-12-04T10:04:38.9303863Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9303903Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9303952Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9304191Z triton_mm_1957 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9304428Z triton_mm_1963 0.0060 ms 96.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9304674Z triton_mm_1951 0.0062 ms 94.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9304909Z triton_mm_1949 0.0062 ms 93.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9305158Z triton_mm_1962 0.0066 ms 88.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9305394Z triton_mm_1955 0.0067 ms 87.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9305632Z triton_mm_1952 0.0067 ms 86.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9305873Z triton_mm_1960 0.0068 ms 85.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9306154Z triton_mm_1961 0.0070 ms 83.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9306410Z triton_mm_1966 0.0071 ms 82.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9306541Z SingleProcess AUTOTUNE benchmarking takes 0.2594 seconds and 0.3004 seconds precompiling for 36 choices 2025-12-04T10:04:38.9306697Z Compiled module path: /tmp/tmpn3oxb425/pd/cpdljtliztflcghm2dbquzesbnirwjr74omnt6m5joku2pxn7n4l.py 2025-12-04T10:04:38.9306831Z Compiled module path: /tmp/tmpn3oxb425/sd/csd5ige4hkh3n5nfat4a75lyk2j6k2acfrta2mnrtvfermqkekf5.py 2025-12-04T10:04:38.9306908Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9306952Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9307013Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9307114Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9307737Z inductor [('triton_bundler_save_kernel', 488), ('benchmarking.InductorBenchmarker.benchmark_gpu', 57), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 21), ('coordesc_tuning_bench', 12), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9307778Z graph_break [] 2025-12-04T10:04:38.9307825Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9307902Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9307943Z Autotune Choices Stats: 2025-12-04T10:04:38.9308315Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2024", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.005679999943822622, "best_triton_pos": 0} 2025-12-04T10:04:38.9308371Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9308416Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9308463Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9308703Z triton_mm_2024 0.0057 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9308952Z triton_mm_2025 0.0062 ms 91.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9309187Z triton_mm_2028 0.0062 ms 91.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9309424Z triton_mm_2034 0.0066 ms 85.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9309662Z triton_mm_2035 0.0066 ms 85.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9309899Z triton_mm_2029 0.0067 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9310134Z triton_mm_2038 0.0068 ms 84.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9310389Z triton_mm_2032 0.0070 ms 81.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9310630Z triton_mm_2027 0.0070 ms 80.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9310867Z triton_mm_2037 0.0074 ms 76.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9310999Z SingleProcess AUTOTUNE benchmarking takes 0.2409 seconds and 0.2831 seconds precompiling for 36 choices 2025-12-04T10:04:38.9311136Z Compiled module path: /tmp/tmpupbxuwll/wf/cwfv5iomwne7ira2lwcn6me5rrocun36pkcb7hcqnr4iz44g4sxk.py 2025-12-04T10:04:38.9311277Z Compiled module path: /tmp/tmpupbxuwll/fd/cfdkx22iuyrmwf4wz6upt5ft52qczyejt7sustzr4yn4kuzj5ygy.py 2025-12-04T10:04:38.9311353Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9311400Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9311457Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9311559Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9312178Z inductor [('triton_bundler_save_kernel', 472), ('benchmarking.InductorBenchmarker.benchmark_gpu', 55), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 19), ('coordesc_tuning_bench', 10), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9312230Z graph_break [] 2025-12-04T10:04:38.9312277Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9312352Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9312394Z Autotune Choices Stats: 2025-12-04T10:04:38.9312782Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2097", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.0063599999994039536, "best_triton_pos": 0} 2025-12-04T10:04:38.9312827Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9312866Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9312914Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9313152Z triton_mm_2097 0.0064 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9313389Z triton_mm_2095 0.0065 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9313626Z triton_mm_2101 0.0065 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9313866Z triton_mm_2107 0.0068 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9314114Z triton_mm_2106 0.0068 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9314363Z triton_mm_2093 0.0068 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9314600Z triton_mm_2109 0.0069 ms 92.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9314837Z triton_mm_2110 0.0071 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9315077Z triton_mm_2108 0.0072 ms 88.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9315313Z triton_mm_2103 0.0074 ms 85.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9315447Z SingleProcess AUTOTUNE benchmarking takes 0.2628 seconds and 0.2730 seconds precompiling for 36 choices 2025-12-04T10:04:38.9315584Z Compiled module path: /tmp/tmp7ybvedqm/4o/c4o4xsuveugtbd5ayuhtr33crgmsknfmd4wvyjd7rfzznncmiaen.py 2025-12-04T10:04:38.9315725Z Compiled module path: /tmp/tmp7ybvedqm/gd/cgdmwk44t7uabcneegntpu5h4sncag65vnvtyswucjx4zidxkrog.py 2025-12-04T10:04:38.9315800Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9315856Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9315914Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9316066Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9316701Z inductor [('triton_bundler_save_kernel', 464), ('benchmarking.InductorBenchmarker.benchmark_gpu', 54), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 18), ('coordesc_tuning_bench', 9), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9316741Z graph_break [] 2025-12-04T10:04:38.9316795Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9316870Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9316919Z Autotune Choices Stats: 2025-12-04T10:04:38.9317294Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2172", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.006279999855905771, "best_triton_pos": 0} 2025-12-04T10:04:38.9317343Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9317383Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9317444Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9317685Z triton_mm_2172 0.0063 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9317943Z triton_mm_2168 0.0066 ms 95.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9318198Z triton_mm_2165 0.0068 ms 92.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9318433Z triton_mm_2164 0.0070 ms 90.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9318670Z triton_mm_2171 0.0070 ms 90.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9318906Z triton_mm_2173 0.0070 ms 90.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9319147Z triton_mm_2179 0.0070 ms 90.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9319380Z triton_mm_2166 0.0070 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9319614Z triton_mm_2167 0.0070 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9319854Z triton_mm_2169 0.0070 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9320006Z SingleProcess AUTOTUNE benchmarking takes 0.2716 seconds and 0.2836 seconds precompiling for 36 choices 2025-12-04T10:04:38.9320145Z Compiled module path: /tmp/tmpnn06v_ys/dt/cdtvrts5b6yekm3rnwmopth2frnwwakwpdfl5xs6g2frnonyp4ie.py 2025-12-04T10:04:38.9320278Z Compiled module path: /tmp/tmpnn06v_ys/ea/ceadmclxzhteoppmjag7nvbdcjmkkkgw23t25bxiofxogvujbuhs.py 2025-12-04T10:04:38.9320365Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9320408Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9320467Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9320566Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9321189Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9321232Z graph_break [] 2025-12-04T10:04:38.9321278Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9321355Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9321396Z Autotune Choices Stats: 2025-12-04T10:04:38.9321784Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2237", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4", "best_time": 0.0061599998734891415, "best_triton_pos": 0} 2025-12-04T10:04:38.9321837Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9321880Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9321926Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9322173Z triton_mm_2237 0.0062 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9322407Z triton_mm_2236 0.0064 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9322641Z triton_mm_2240 0.0065 ms 94.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9322878Z triton_mm_2245 0.0070 ms 88.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9323116Z triton_mm_2246 0.0075 ms 81.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9323352Z triton_mm_2252 0.0078 ms 78.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9323587Z triton_mm_2241 0.0079 ms 78.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9323833Z triton_mm_2248 0.0079 ms 77.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9324069Z triton_mm_2238 0.0080 ms 77.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9324323Z triton_mm_2251 0.0080 ms 76.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9324456Z SingleProcess AUTOTUNE benchmarking takes 0.2688 seconds and 0.2640 seconds precompiling for 36 choices 2025-12-04T10:04:38.9324598Z Compiled module path: /tmp/tmpr9mbe6lj/mq/cmqf76uxyuuzvl2yjw4qf5mrxv7rzu6bgzunmv4mtintcaeiwmgp.py 2025-12-04T10:04:38.9324738Z Compiled module path: /tmp/tmpr9mbe6lj/ww/cwwklwc6j6odl3luluxzprcysw6eoqz5iezsxsfjuj546el5s5ak.py 2025-12-04T10:04:38.9324814Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9324865Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9324924Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9325029Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9325657Z inductor [('triton_bundler_save_kernel', 448), ('benchmarking.InductorBenchmarker.benchmark_gpu', 52), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 16), ('coordesc_tuning_bench', 7), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9325711Z graph_break [] 2025-12-04T10:04:38.9325758Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9325836Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9325881Z Autotune Choices Stats: 2025-12-04T10:04:38.9326311Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2317", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T10:04:38.9326355Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9326402Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9326450Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9326696Z triton_mm_2317 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9326935Z triton_mm_2311 0.0062 ms 96.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9327172Z triton_mm_2313 0.0066 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9327408Z triton_mm_2310 0.0066 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9327662Z triton_mm_2320 0.0068 ms 88.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9327899Z triton_mm_2312 0.0069 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9328155Z triton_mm_2326 0.0070 ms 86.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9328395Z triton_mm_2316 0.0072 ms 83.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9328637Z triton_mm_2323 0.0072 ms 83.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9328871Z triton_mm_2315 0.0072 ms 83.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9329006Z SingleProcess AUTOTUNE benchmarking takes 0.2615 seconds and 0.2780 seconds precompiling for 36 choices 2025-12-04T10:04:38.9329147Z Compiled module path: /tmp/tmpz2i1m2et/u6/cu6ot52uve4cl5ogkz6biotjjaklfdohlcspilufjjxcu7ya5nnw.py 2025-12-04T10:04:38.9329291Z Compiled module path: /tmp/tmpz2i1m2et/re/crebrfazakxda4xde2hrbab3oghd3vj57o65vkyqockgyxo2mgys.py 2025-12-04T10:04:38.9329367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9329433Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9329493Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9329612Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9330234Z inductor [('triton_bundler_save_kernel', 464), ('benchmarking.InductorBenchmarker.benchmark_gpu', 54), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 18), ('coordesc_tuning_bench', 9), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9330273Z graph_break [] 2025-12-04T10:04:38.9330324Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9330400Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9330444Z Autotune Choices Stats: 2025-12-04T10:04:38.9330817Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2383", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005758999846875668, "best_triton_pos": 0} 2025-12-04T10:04:38.9330863Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9330903Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9330956Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9331192Z triton_mm_2383 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9331428Z triton_mm_2378 0.0060 ms 96.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9331673Z triton_mm_2381 0.0061 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9331921Z triton_mm_2379 0.0062 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9332157Z triton_mm_2387 0.0062 ms 93.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9332399Z triton_mm_2377 0.0062 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=32, BLOCK_N=16, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=2 2025-12-04T10:04:38.9332637Z triton_mm_2384 0.0064 ms 90.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9332872Z triton_mm_2376 0.0064 ms 89.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=16, BLOCK_N=16, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=8, num_stages=2, num_warps=1 2025-12-04T10:04:38.9333107Z triton_mm_2382 0.0065 ms 88.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9333360Z triton_mm_2388 0.0066 ms 87.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9333500Z SingleProcess AUTOTUNE benchmarking takes 0.2201 seconds and 0.2729 seconds precompiling for 36 choices 2025-12-04T10:04:38.9333638Z Compiled module path: /tmp/tmpl_j6av6b/64/c64cyvcbi26v5parqlwdbty3fu5fg6grlry6xugeomsbxeiihmi3.py 2025-12-04T10:04:38.9333771Z Compiled module path: /tmp/tmpl_j6av6b/cn/ccnq3ojwmcmo7hczekbmheijwug5rckfsrr2qajctvqfsa3ssor2.py 2025-12-04T10:04:38.9333850Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9333893Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9333952Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9334050Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9334671Z inductor [('triton_bundler_save_kernel', 432), ('benchmarking.InductorBenchmarker.benchmark_gpu', 50), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 14), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9334710Z graph_break [] 2025-12-04T10:04:38.9334758Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9334832Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9334878Z Autotune Choices Stats: 2025-12-04T10:04:38.9335254Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2453", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4", "best_time": 0.005880000069737434, "best_triton_pos": 0} 2025-12-04T10:04:38.9335308Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9335349Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9335395Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9335645Z triton_mm_2453 0.0059 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9335878Z triton_mm_2461 0.0059 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9336162Z triton_mm_2460 0.0060 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9336398Z triton_mm_2467 0.0061 ms 96.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9336638Z triton_mm_2459 0.0064 ms 92.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9336874Z triton_mm_2456 0.0068 ms 87.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9337120Z triton_mm_2457 0.0069 ms 85.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9337371Z triton_mm_2464 0.0070 ms 84.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9337603Z triton_mm_2455 0.0070 ms 84.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9337843Z triton_mm_2470 0.0070 ms 84.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9337973Z SingleProcess AUTOTUNE benchmarking takes 0.2662 seconds and 0.2750 seconds precompiling for 36 choices 2025-12-04T10:04:38.9338116Z Compiled module path: /tmp/tmpt4ls5vhy/rz/crzftzk7qcpxx4rj22eaygqwzg2iin55y4pljlnweaiicdvhrhbc.py 2025-12-04T10:04:38.9338251Z Compiled module path: /tmp/tmpt4ls5vhy/qn/cqnk5pd6cp65b2byl2qoz4d2thgxk5yznkcnwjpytdmsyx2wk7fs.py 2025-12-04T10:04:38.9338328Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9338376Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9338434Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9338537Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9339160Z inductor [('triton_bundler_save_kernel', 472), ('benchmarking.InductorBenchmarker.benchmark_gpu', 55), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 19), ('coordesc_tuning_bench', 10), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9339223Z graph_break [] 2025-12-04T10:04:38.9339271Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9339347Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9339386Z Autotune Choices Stats: 2025-12-04T10:04:38.9339775Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2528", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.005799000151455402, "best_triton_pos": 0} 2025-12-04T10:04:38.9339821Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9339865Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9339912Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9340152Z triton_mm_2528 0.0058 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9340389Z triton_mm_2539 0.0060 ms 97.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9340626Z triton_mm_2532 0.0062 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9340865Z triton_mm_2529 0.0062 ms 92.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9341110Z triton_mm_2526 0.0065 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9341359Z triton_mm_2533 0.0066 ms 87.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9341593Z triton_mm_2527 0.0066 ms 87.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9341834Z triton_mm_2538 0.0067 ms 86.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9342071Z triton_mm_2525 0.0068 ms 85.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9342306Z triton_mm_2541 0.0073 ms 79.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9342440Z SingleProcess AUTOTUNE benchmarking takes 0.2610 seconds and 0.2669 seconds precompiling for 36 choices 2025-12-04T10:04:38.9342578Z Compiled module path: /tmp/tmpd5jk6yhp/gw/cgwjnigzmj2ezgvdoy57esrmwfvq7v6liqbcza53vvjc3i3zmgxu.py 2025-12-04T10:04:38.9342719Z Compiled module path: /tmp/tmpd5jk6yhp/yf/cyfmnfk4wuy5n6xwnau5dcwxdhnhvk4nyp5avwih6qcexpk5ysyx.py 2025-12-04T10:04:38.9342814Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9342860Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9342920Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9343021Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9343648Z inductor [('triton_bundler_save_kernel', 440), ('benchmarking.InductorBenchmarker.benchmark_gpu', 51), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 15), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9343688Z graph_break [] 2025-12-04T10:04:38.9343737Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9343814Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9343856Z Autotune Choices Stats: 2025-12-04T10:04:38.9344231Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2605", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.006560000125318766, "best_triton_pos": 0} 2025-12-04T10:04:38.9344279Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9344320Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9344368Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9344607Z triton_mm_2605 0.0066 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9344860Z triton_mm_2597 0.0068 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9345107Z triton_mm_2600 0.0068 ms 95.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9345341Z triton_mm_2599 0.0069 ms 95.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9345574Z triton_mm_2598 0.0070 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9345809Z triton_mm_2603 0.0070 ms 93.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9346083Z triton_mm_2601 0.0070 ms 93.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9346318Z triton_mm_2604 0.0071 ms 92.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9346554Z triton_mm_2611 0.0072 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9346813Z triton_mm_2608 0.0072 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9346944Z SingleProcess AUTOTUNE benchmarking takes 0.2705 seconds and 0.2758 seconds precompiling for 36 choices 2025-12-04T10:04:38.9347088Z Compiled module path: /tmp/tmpsmv39ob2/cb/ccbmpgeennbwoyjjiix3lca3lbb6ut5oirfby45ki3kvgostp24v.py 2025-12-04T10:04:38.9347249Z Compiled module path: /tmp/tmpsmv39ob2/tx/ctxqtmrxixapgqvsvu263l3joyexxzywhhbbersba5o5t5esny2s.py 2025-12-04T10:04:38.9347329Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9347373Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9347431Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9347531Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9348148Z inductor [('triton_bundler_save_kernel', 432), ('benchmarking.InductorBenchmarker.benchmark_gpu', 50), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 14), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9348188Z graph_break [] 2025-12-04T10:04:38.9348237Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9348310Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9348353Z Autotune Choices Stats: 2025-12-04T10:04:38.9348739Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2675", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.006479999981820583, "best_triton_pos": 0} 2025-12-04T10:04:38.9348798Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9348839Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9348886Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9349125Z triton_mm_2675 0.0065 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9349363Z triton_mm_2669 0.0066 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9349596Z triton_mm_2670 0.0067 ms 97.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9349835Z triton_mm_2676 0.0068 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9350076Z triton_mm_2671 0.0070 ms 93.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9350311Z triton_mm_2672 0.0070 ms 92.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9350558Z triton_mm_2681 0.0070 ms 92.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9350805Z triton_mm_2673 0.0070 ms 92.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9351048Z triton_mm_2682 0.0070 ms 92.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9351282Z triton_mm_2677 0.0070 ms 92.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9351412Z SingleProcess AUTOTUNE benchmarking takes 0.2665 seconds and 0.2768 seconds precompiling for 36 choices 2025-12-04T10:04:38.9351551Z Compiled module path: /tmp/tmp0b6sms4g/7b/c7boz5dftlm4qa2padkqlfrvb575kvd4ff4orcjbjhztnti6mla7.py 2025-12-04T10:04:38.9351690Z Compiled module path: /tmp/tmp0b6sms4g/m5/cm5s5c2piphijc4jgvoleaxomoby7vgxgufnkb7ud4ayn3zeiotq.py 2025-12-04T10:04:38.9351768Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9351810Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9351869Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9351968Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9352596Z inductor [('triton_bundler_save_kernel', 480), ('benchmarking.InductorBenchmarker.benchmark_gpu', 56), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 20), ('coordesc_tuning_bench', 11), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9352654Z graph_break [] 2025-12-04T10:04:38.9352701Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9352777Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9352819Z Autotune Choices Stats: 2025-12-04T10:04:38.9353193Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2742", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.005960000213235617, "best_triton_pos": 0} 2025-12-04T10:04:38.9353235Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9353278Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9353324Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9353563Z triton_mm_2742 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9353798Z triton_mm_2745 0.0063 ms 94.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9354031Z triton_mm_2744 0.0065 ms 91.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9354265Z triton_mm_2743 0.0066 ms 90.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9354510Z triton_mm_2749 0.0067 ms 89.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9354759Z triton_mm_2747 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9354995Z triton_mm_2755 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9355232Z triton_mm_2748 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9355470Z triton_mm_2752 0.0068 ms 87.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9355709Z triton_mm_2758 0.0071 ms 84.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9355840Z SingleProcess AUTOTUNE benchmarking takes 0.2630 seconds and 0.2707 seconds precompiling for 36 choices 2025-12-04T10:04:38.9356016Z Compiled module path: /tmp/tmpsn5d6pwl/zp/czpbxyxowaqioifrysy6f3wqk22ivqgaf7boi7wwdkvs2akecsbo.py 2025-12-04T10:04:38.9356156Z Compiled module path: /tmp/tmpsn5d6pwl/2g/c2glqzegdi4vpocysemtvrvmdljfqep2sz7fsnvjwuvqlbfixxbu.py 2025-12-04T10:04:38.9356255Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9356313Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9356370Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9356474Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9357095Z inductor [('triton_bundler_save_kernel', 432), ('benchmarking.InductorBenchmarker.benchmark_gpu', 50), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 14), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9357136Z graph_break [] 2025-12-04T10:04:38.9357183Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9357261Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9357302Z Autotune Choices Stats: 2025-12-04T10:04:38.9357680Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2814", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.006120000034570694, "best_triton_pos": 0} 2025-12-04T10:04:38.9357729Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9357771Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9365265Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9365524Z triton_mm_2814 0.0061 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9365809Z triton_mm_2827 0.0064 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9366093Z triton_mm_2824 0.0067 ms 91.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9366350Z triton_mm_2821 0.0069 ms 89.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9366584Z triton_mm_2817 0.0069 ms 88.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9366824Z triton_mm_2813 0.0072 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9367064Z triton_mm_2825 0.0072 ms 85.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9367300Z triton_mm_2826 0.0074 ms 82.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9367547Z triton_mm_2815 0.0074 ms 82.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9367781Z triton_mm_2819 0.0076 ms 80.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9367930Z SingleProcess AUTOTUNE benchmarking takes 0.2720 seconds and 0.2795 seconds precompiling for 36 choices 2025-12-04T10:04:38.9368074Z Compiled module path: /tmp/tmp66buw1v0/4c/c4cxcogaoflip7ng3tanrcilkqbzqxqvjrl2qq24tjgyzxdq73s5.py 2025-12-04T10:04:38.9368216Z Compiled module path: /tmp/tmp66buw1v0/os/costhyfbzdejzzdwwtl6ks4t4stsggd2lsn6qrpryuipqjgfvejq.py 2025-12-04T10:04:38.9368294Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9368341Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9368400Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9368506Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9369128Z inductor [('triton_bundler_save_kernel', 480), ('benchmarking.InductorBenchmarker.benchmark_gpu', 56), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 20), ('coordesc_tuning_bench', 11), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9369169Z graph_break [] 2025-12-04T10:04:38.9369219Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9369296Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9369339Z Autotune Choices Stats: 2025-12-04T10:04:38.9369714Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2886", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.006000000052154064, "best_triton_pos": 0} 2025-12-04T10:04:38.9369780Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9369820Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9369869Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9370118Z triton_mm_2886 0.0060 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9370357Z triton_mm_2883 0.0062 ms 97.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9370592Z triton_mm_2893 0.0064 ms 94.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9370830Z triton_mm_2897 0.0066 ms 91.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9371066Z triton_mm_2884 0.0066 ms 90.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9371301Z triton_mm_2885 0.0066 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9371547Z triton_mm_2888 0.0066 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9371791Z triton_mm_2889 0.0066 ms 90.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9372025Z triton_mm_2887 0.0067 ms 89.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9372259Z triton_mm_2899 0.0068 ms 88.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9372392Z SingleProcess AUTOTUNE benchmarking takes 0.2354 seconds and 0.2538 seconds precompiling for 36 choices 2025-12-04T10:04:38.9372528Z Compiled module path: /tmp/tmpzm_slxj2/e7/ce7rv2m6xkjqu3iaf7qshd6ggfukudjyb54u6htbnphxr6w2zzra.py 2025-12-04T10:04:38.9372661Z Compiled module path: /tmp/tmpzm_slxj2/4e/c4ehdxwpzmlagenw4znjcd3dd6yejr7ttnebumf7tf2pmo5cxzmt.py 2025-12-04T10:04:38.9372737Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9372779Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9372840Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9372939Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9373561Z inductor [('triton_bundler_save_kernel', 480), ('benchmarking.InductorBenchmarker.benchmark_gpu', 56), ('async_compile_cache_miss', 44), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 20), ('coordesc_tuning_bench', 11), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1)] 2025-12-04T10:04:38.9373615Z graph_break [] 2025-12-04T10:04:38.9373662Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9373737Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9373778Z Autotune Choices Stats: 2025-12-04T10:04:38.9374163Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_2964", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8", "best_time": 0.006159000098705292, "best_triton_pos": 0} 2025-12-04T10:04:38.9374207Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9374250Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9374298Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9374541Z triton_mm_2964 0.0062 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=16, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9374777Z triton_mm_2959 0.0063 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9375010Z triton_mm_2960 0.0064 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9375255Z triton_mm_2968 0.0064 ms 96.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9375501Z triton_mm_2957 0.0065 ms 94.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9375738Z triton_mm_2965 0.0066 ms 93.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9376007Z triton_mm_2958 0.0067 ms 91.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9376244Z triton_mm_2961 0.0068 ms 90.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9376479Z triton_mm_2970 0.0069 ms 89.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9376712Z triton_mm_2963 0.0069 ms 89.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9376844Z SingleProcess AUTOTUNE benchmarking takes 0.2648 seconds and 0.2754 seconds precompiling for 36 choices 2025-12-04T10:04:38.9376980Z Compiled module path: /tmp/tmp8i5jvn9l/ae/caer4mqz2o2sdiy2lunmqv7jh67inqjastuijzei2tdxc6ytahzr.py 2025-12-04T10:04:38.9377136Z Compiled module path: /tmp/tmp8i5jvn9l/f7/cf7zutpg36lvs4vdclkl6vqj6bz32k7bjkqq234zxyfwec2azj6y.py 2025-12-04T10:04:38.9377210Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:04:38.9377256Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:04:38.9377313Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T10:04:38.9377414Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:04:38.9378082Z inductor [('triton_bundler_save_kernel', 488), ('benchmarking.InductorBenchmarker.benchmark_gpu', 62), ('async_compile_cache_miss', 47), ('generated_module_cache_miss', 36), ('select_algorithm_num_precompiles', 36), ('generated_module_cache_hit', 36), ('benchmarking.InductorBenchmarker.benchmark', 26), ('coordesc_tuning_bench', 12), ('fxgraph_cache_miss', 2), ('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('triton_bundler_save_static_autotuner', 2), ('select_algorithm_precompile', 1), ('select_algorithm_autotune', 1), ('async_compile_cache_hit', 1)] 2025-12-04T10:04:38.9378124Z graph_break [] 2025-12-04T10:04:38.9378170Z aten_mm_info [('aten.mm_256_256_256', 2)] 2025-12-04T10:04:38.9378247Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:04:38.9378287Z Autotune Choices Stats: 2025-12-04T10:04:38.9378659Z {"num_choices": 36, "num_triton_choices": 36, "best_kernel": "triton_mm_3033", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4", "best_time": 0.006240000016987324, "best_triton_pos": 0} 2025-12-04T10:04:38.9378702Z AUTOTUNE mm(256x256, 256x256) 2025-12-04T10:04:38.9378743Z strides: [256, 1], [1, 256] 2025-12-04T10:04:38.9378789Z dtypes: torch.float16, torch.float16 2025-12-04T10:04:38.9379045Z triton_mm_3033 0.0062 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9379279Z triton_mm_3031 0.0063 ms 98.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9379524Z triton_mm_3030 0.0064 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9379760Z triton_mm_3037 0.0065 ms 96.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=256, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9379992Z triton_mm_3032 0.0066 ms 95.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9380227Z triton_mm_3042 0.0067 ms 93.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=128, BLOCK_N=32, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=4 2025-12-04T10:04:38.9380463Z triton_mm_3040 0.0068 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9380700Z triton_mm_3046 0.0070 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=128, BLOCK_N=64, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9380939Z triton_mm_3041 0.0070 ms 89.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=128, EVEN_K=True, GROUP_M=4, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=0, num_stages=2, num_warps=8 2025-12-04T10:04:38.9381184Z triton_mm_3029 0.0072 ms 87.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=16, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, kpack=2, matrix_instr_nonkdim=16, waves_per_eu=2, num_stages=2, num_warps=4 2025-12-04T10:04:38.9381315Z SingleProcess AUTOTUNE benchmarking takes 0.2654 seconds and 0.2749 seconds precompiling for 36 choices 2025-12-04T10:04:38.9381459Z Compiled module path: /tmp/tmpn_rptzwz/ax/cax6bp7girhycgspdrux552hbxqycvwfd5x7hoqx3a3twvtcdrv5.py 2025-12-04T10:04:38.9381594Z Compiled module path: /tmp/tmpn_rptzwz/yq/cyqy6uo63rufxl2ldg7nzh7mbgxxzheprqzrf7bv5i5sofzd5na2.py 2025-12-04T10:04:38.9381832Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_benchmark_fusion/inductor.test_benchmark_fusion-32f423bfb0824e63.xml - 2025-12-04T10:04:38.9381899Z =========================== short test summary info ============================ 2025-12-04T10:04:38.9382187Z FAILED [2.5344s] inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code - RuntimeError: Expected to find "triton_tem_fused_addmm_relu_t_0" but did not find it 2025-12-04T10:04:38.9382227Z Searched string: 2025-12-04T10:04:38.9382276Z with torch.cuda._DeviceGuard(0): 2025-12-04T10:04:38.9382320Z torch.cuda.set_device(0) 2025-12-04T10:04:38.9382394Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float16) 2025-12-04T10:04:38.9382492Z # Topologically Sorted Source Nodes: [a], Original ATen: [aten.t, aten.addmm] 2025-12-04T10:04:38.9382540Z stream0 = get_raw_stream(0) 2025-12-04T10:04:38.9382629Z triton_tem_fused_addmm_t_0.run(arg2_1, arg0_1, buf0, 64, 1, 1, stream=stream0) 2025-12-04T10:04:38.9382671Z del arg0_1 2025-12-04T10:04:38.9382709Z del arg2_1 2025-12-04T10:04:38.9382768Z buf1 = buf0; del buf0 # reuse 2025-12-04T10:04:38.9382873Z # Topologically Sorted Source Nodes: [a, relu], Original ATen: [aten.addmm, aten.relu] 2025-12-04T10:04:38.9382939Z stream0 = get_raw_stream(0) 2025-12-04T10:04:38.9383022Z triton_poi_fused_addmm_relu_1.run(buf1, arg1_1, 65536, stream=stream0) 2025-12-04T10:04:38.9383061Z del arg1_1 2025-12-04T10:04:38.9383100Z return (buf1, ) 2025-12-04T10:04:38.9383103Z 2025-12-04T10:04:38.9383151Z runner = Runner(partitions=[]) 2025-12-04T10:04:38.9383191Z call = runner.call 2025-12-04T10:04:38.9383261Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T10:04:38.9383263Z 2025-12-04T10:04:38.9383265Z 2025-12-04T10:04:38.9383324Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T10:04:38.9383382Z from torch._dynamo.testing import rand_strided 2025-12-04T10:04:38.9383446Z from torch._inductor.utils import print_performance 2025-12-04T10:04:38.9383534Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9383610Z arg1_1 = rand_strided((256, ), (1, ), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9383691Z arg2_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9383742Z fn = lambda: call([arg0_1, arg1_1, arg2_1]) 2025-12-04T10:04:38.9383814Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T10:04:38.9383817Z 2025-12-04T10:04:38.9383818Z 2025-12-04T10:04:38.9383861Z if __name__ == "__main__": 2025-12-04T10:04:38.9383948Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T10:04:38.9384015Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T10:04:38.9384067Z From CHECK: triton_tem_fused_addmm_relu_t_0 2025-12-04T10:04:38.9384069Z 2025-12-04T10:04:38.9384071Z 2025-12-04T10:04:38.9384144Z To execute this test, run the following from the base repo dir: 2025-12-04T10:04:38.9384347Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_benchmark_fusion.py BenchmarkMultiTemplateFusionGpuTest.test_equivalent_template_code 2025-12-04T10:04:38.9384351Z 2025-12-04T10:04:38.9384439Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:04:38.9384709Z FAILED [2.2923s] inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code - RuntimeError: Expected to find "triton_tem_fused_addmm_relu_t_0" but did not find it 2025-12-04T10:04:38.9384747Z Searched string: 2025-12-04T10:04:38.9384813Z with torch.cuda._DeviceGuard(0): 2025-12-04T10:04:38.9384858Z torch.cuda.set_device(0) 2025-12-04T10:04:38.9384928Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float16) 2025-12-04T10:04:38.9385022Z # Topologically Sorted Source Nodes: [a], Original ATen: [aten.t, aten.addmm] 2025-12-04T10:04:38.9385066Z stream0 = get_raw_stream(0) 2025-12-04T10:04:38.9385155Z triton_tem_fused_addmm_t_0.run(arg2_1, arg0_1, buf0, 32, 1, 1, stream=stream0) 2025-12-04T10:04:38.9385194Z del arg0_1 2025-12-04T10:04:38.9385232Z del arg2_1 2025-12-04T10:04:38.9385276Z buf1 = buf0; del buf0 # reuse 2025-12-04T10:04:38.9385381Z # Topologically Sorted Source Nodes: [a, relu], Original ATen: [aten.addmm, aten.relu] 2025-12-04T10:04:38.9385425Z stream0 = get_raw_stream(0) 2025-12-04T10:04:38.9385507Z triton_poi_fused_addmm_relu_1.run(buf1, arg1_1, 65536, stream=stream0) 2025-12-04T10:04:38.9385544Z del arg1_1 2025-12-04T10:04:38.9385585Z return (buf1, ) 2025-12-04T10:04:38.9385587Z 2025-12-04T10:04:38.9385633Z runner = Runner(partitions=[]) 2025-12-04T10:04:38.9385672Z call = runner.call 2025-12-04T10:04:38.9385739Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T10:04:38.9385741Z 2025-12-04T10:04:38.9385743Z 2025-12-04T10:04:38.9385805Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T10:04:38.9385872Z from torch._dynamo.testing import rand_strided 2025-12-04T10:04:38.9385982Z from torch._inductor.utils import print_performance 2025-12-04T10:04:38.9386088Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9386166Z arg1_1 = rand_strided((256, ), (1, ), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9386244Z arg2_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9386295Z fn = lambda: call([arg0_1, arg1_1, arg2_1]) 2025-12-04T10:04:38.9386365Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T10:04:38.9386367Z 2025-12-04T10:04:38.9386369Z 2025-12-04T10:04:38.9386410Z if __name__ == "__main__": 2025-12-04T10:04:38.9386492Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T10:04:38.9386560Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T10:04:38.9386613Z From CHECK: triton_tem_fused_addmm_relu_t_0 2025-12-04T10:04:38.9386616Z 2025-12-04T10:04:38.9386618Z 2025-12-04T10:04:38.9386694Z To execute this test, run the following from the base repo dir: 2025-12-04T10:04:38.9386878Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_benchmark_fusion.py BenchmarkMultiTemplateFusionGpuTest.test_equivalent_template_code 2025-12-04T10:04:38.9386881Z 2025-12-04T10:04:38.9386969Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:04:38.9387237Z FAILED [2.1570s] inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code - RuntimeError: Expected to find "triton_tem_fused_addmm_relu_t_0" but did not find it 2025-12-04T10:04:38.9387276Z Searched string: 2025-12-04T10:04:38.9387325Z with torch.cuda._DeviceGuard(0): 2025-12-04T10:04:38.9387370Z torch.cuda.set_device(0) 2025-12-04T10:04:38.9387438Z buf0 = empty_strided_cuda((256, 256), (256, 1), torch.float16) 2025-12-04T10:04:38.9387551Z # Topologically Sorted Source Nodes: [a], Original ATen: [aten.t, aten.addmm] 2025-12-04T10:04:38.9387595Z stream0 = get_raw_stream(0) 2025-12-04T10:04:38.9387682Z triton_tem_fused_addmm_t_0.run(arg2_1, arg0_1, buf0, 32, 1, 1, stream=stream0) 2025-12-04T10:04:38.9387718Z del arg0_1 2025-12-04T10:04:38.9387757Z del arg2_1 2025-12-04T10:04:38.9387801Z buf1 = buf0; del buf0 # reuse 2025-12-04T10:04:38.9387919Z # Topologically Sorted Source Nodes: [a, relu], Original ATen: [aten.addmm, aten.relu] 2025-12-04T10:04:38.9387963Z stream0 = get_raw_stream(0) 2025-12-04T10:04:38.9388043Z triton_poi_fused_addmm_relu_1.run(buf1, arg1_1, 65536, stream=stream0) 2025-12-04T10:04:38.9388080Z del arg1_1 2025-12-04T10:04:38.9388119Z return (buf1, ) 2025-12-04T10:04:38.9388121Z 2025-12-04T10:04:38.9388167Z runner = Runner(partitions=[]) 2025-12-04T10:04:38.9388204Z call = runner.call 2025-12-04T10:04:38.9388270Z recursively_apply_fns = runner.recursively_apply_fns 2025-12-04T10:04:38.9388274Z 2025-12-04T10:04:38.9388277Z 2025-12-04T10:04:38.9388336Z def benchmark_compiled_module(times=10, repeat=10): 2025-12-04T10:04:38.9388391Z from torch._dynamo.testing import rand_strided 2025-12-04T10:04:38.9388454Z from torch._inductor.utils import print_performance 2025-12-04T10:04:38.9388537Z arg0_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9388612Z arg1_1 = rand_strided((256, ), (1, ), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9388690Z arg2_1 = rand_strided((256, 256), (256, 1), device='cuda:0', dtype=torch.float16) 2025-12-04T10:04:38.9388740Z fn = lambda: call([arg0_1, arg1_1, arg2_1]) 2025-12-04T10:04:38.9388811Z return print_performance(fn, times=times, repeat=repeat) 2025-12-04T10:04:38.9388813Z 2025-12-04T10:04:38.9388814Z 2025-12-04T10:04:38.9388856Z if __name__ == "__main__": 2025-12-04T10:04:38.9388957Z from torch._inductor.wrapper_benchmark import compiled_module_main 2025-12-04T10:04:38.9389023Z compiled_module_main('None', benchmark_compiled_module) 2025-12-04T10:04:38.9389089Z From CHECK: triton_tem_fused_addmm_relu_t_0 2025-12-04T10:04:38.9389091Z 2025-12-04T10:04:38.9389093Z 2025-12-04T10:04:38.9389165Z To execute this test, run the following from the base repo dir: 2025-12-04T10:04:38.9389348Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_benchmark_fusion.py BenchmarkMultiTemplateFusionGpuTest.test_equivalent_template_code 2025-12-04T10:04:38.9389351Z 2025-12-04T10:04:38.9389436Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:04:38.9389504Z =================== 3 failed, 97 passed in 237.15s (0:03:57) =================== 2025-12-04T10:04:38.9389506Z 2025-12-04T10:04:38.9389687Z FINISHED PRINTING LOG FILE of inductor/test_benchmark_fusion 1/1 (test/test-reports/inductor.test_benchmark_fusion_1.1_ea0e2b7da1ec2de3_.log) 2025-12-04T10:04:38.9389692Z 2025-12-04T10:04:38.9389814Z Finished inductor/test_benchmark_fusion 1/1 ... [2025-12-04 10:04:38.844019][5636699.350418871], took 4.10min 2025-12-04T10:04:38.9390057Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:04:38.9390164Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:04:38.9390260Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T10:04:38.9390312Z Uploading artifacts took 0.00 seconds 2025-12-04T10:04:38.9390364Z inductor/test_benchmark_fusion 1/1 failed! 2025-12-04T10:04:38.9390452Z Running export/test_serdes 1/1 ... [2025-12-04 10:04:38.932463][5636699.438860356] 2025-12-04T10:04:38.9390501Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:04:38.9390857Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'export/test_serdes.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:04:38.932786] 2025-12-04T10:04:45.2553325Z 2025-12-04T10:04:45.2554532Z export/test_serdes 1/1 was successful, full logs can be found in artifacts with path test/test-reports/export.test_serdes_1.1_12d6d8d23c034c5c_.log 2025-12-04T10:04:45.2555477Z Running 0 items in this shard: 2025-12-04T10:04:45.2555742Z 2025-12-04T10:04:45.2557037Z Finished export/test_serdes 1/1 ... [2025-12-04 10:04:45.254987][5636705.761387988], took 0.11min 2025-12-04T10:04:45.2562768Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:04:45.3430001Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:04:45.3437250Z Running inductor/test_combo_kernels 1/1 ... [2025-12-04 10:04:45.343297][5636705.849695049] 2025-12-04T10:04:45.3437998Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:04:45.3439549Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_combo_kernels.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:04:45.343625] 2025-12-04T10:04:50.8086533Z 2025-12-04T10:04:50.8087963Z inductor/test_combo_kernels 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_combo_kernels_1.1_f996d78c775b6c8b_.log 2025-12-04T10:04:50.8088978Z Running 0 items in this shard: 2025-12-04T10:04:50.8089231Z 2025-12-04T10:04:50.8089641Z Finished inductor/test_combo_kernels 1/1 ... [2025-12-04 10:04:50.808296][5636711.314699312], took 0.09min 2025-12-04T10:04:50.8092774Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:04:50.8955358Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:04:50.8961761Z Running inductor/test_control_deps 1/1 ... [2025-12-04 10:04:50.895946][5636711.402343477] 2025-12-04T10:04:50.8962401Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:04:50.8965065Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_control_deps.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:04:50.896271] 2025-12-04T10:04:56.1066184Z 2025-12-04T10:04:56.1067476Z inductor/test_control_deps 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_control_deps_1.1_80c5da9d7fcfc418_.log 2025-12-04T10:04:56.1068544Z Running 0 items in this shard: 2025-12-04T10:04:56.1068808Z 2025-12-04T10:04:56.1069221Z Finished inductor/test_control_deps 1/1 ... [2025-12-04 10:04:56.106322][5636716.612724659], took 0.09min 2025-12-04T10:04:56.1074116Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:04:56.1941107Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:04:56.1947711Z Running inductor/test_compiled_optimizers 2/2 ... [2025-12-04 10:04:56.194530][5636716.700928675] 2025-12-04T10:04:56.1948400Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:04:56.1953488Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_compiled_optimizers.py', '--shard-id=2', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:04:56.194853] 2025-12-04T10:05:03.2080085Z 2025-12-04T10:05:03.2081485Z inductor/test_compiled_optimizers 2/2 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_compiled_optimizers_2.2_0b23a718f655146e_.log 2025-12-04T10:05:03.2082607Z Running 0 items in this shard: 2025-12-04T10:05:03.2082869Z 2025-12-04T10:05:03.2083310Z Finished inductor/test_compiled_optimizers 2/2 ... [2025-12-04 10:05:03.207657][5636723.714059233], took 0.12min 2025-12-04T10:05:03.2088629Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:05:03.2950415Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:05:03.2958472Z Running dynamo/test_unittest 1/1 ... [2025-12-04 10:05:03.295547][5636723.801945297] 2025-12-04T10:05:03.2959119Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:05:03.2963326Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_unittest.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:05:03.295862] 2025-12-04T10:05:05.4504024Z 2025-12-04T10:05:05.4505238Z dynamo/test_unittest 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_unittest_1.1_300216f34233090c_.log 2025-12-04T10:05:05.4506417Z Running 0 items in this shard: 2025-12-04T10:05:05.4506674Z 2025-12-04T10:05:05.4507125Z Finished dynamo/test_unittest 1/1 ... [2025-12-04 10:05:05.450065][5636725.956465388], took 0.04min 2025-12-04T10:05:05.4512359Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:05:05.5377281Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:05:05.5385240Z Running dynamo/test_streams 1/1 ... [2025-12-04 10:05:05.538259][5636726.044657557] 2025-12-04T10:05:05.5385561Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:05:05.5387308Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_streams.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:05:05.538559] 2025-12-04T10:05:07.8638994Z 2025-12-04T10:05:07.8640293Z dynamo/test_streams 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_streams_1.1_15376daf99bb2b47_.log 2025-12-04T10:05:07.8641260Z Running 0 items in this shard: 2025-12-04T10:05:07.8641527Z 2025-12-04T10:05:07.8641887Z Finished dynamo/test_streams 1/1 ... [2025-12-04 10:05:07.863544][5636728.369944074], took 0.04min 2025-12-04T10:05:07.8653215Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:05:07.9519024Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:05:07.9525423Z Running inductor/test_unbacked_symints 1/1 ... [2025-12-04 10:05:07.952277][5636728.458675012] 2025-12-04T10:05:07.9526311Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:05:07.9528138Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_unbacked_symints.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:05:07.952581] 2025-12-04T10:05:13.8207859Z 2025-12-04T10:05:13.8209785Z inductor/test_unbacked_symints 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_unbacked_symints_1.1_9e00f2859cf9ea7c_.log 2025-12-04T10:05:13.8211440Z Running 0 items in this shard: 2025-12-04T10:05:13.8211708Z 2025-12-04T10:05:13.8212129Z Finished inductor/test_unbacked_symints 1/1 ... [2025-12-04 10:05:13.820464][5636734.326863264], took 0.10min 2025-12-04T10:05:13.8216703Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:05:13.9082748Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:05:13.9089521Z Running inductor/test_mix_order_reduction 1/1 ... [2025-12-04 10:05:13.908672][5636734.415069585] 2025-12-04T10:05:13.9090214Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:05:13.9091997Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_mix_order_reduction.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:05:13.908988] 2025-12-04T10:05:37.5729372Z 2025-12-04T10:05:37.5730470Z PRINTING LOG FILE of inductor/test_mix_order_reduction 1/1 (test/test-reports/inductor.test_mix_order_reduction_1.1_7f4938ab0968cfe4_.log) 2025-12-04T10:05:37.5731779Z Test results will be stored in test-reports/python-pytest/inductor.test_mix_order_reduction/inductor.test_mix_order_reduction-181fc8d92d9e2229.xml 2025-12-04T10:05:37.5732750Z ============================= test session starts ============================== 2025-12-04T10:05:37.5733493Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T10:05:37.5734130Z cachedir: .pytest_cache 2025-12-04T10:05:37.5734925Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T10:05:37.5735719Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T10:05:37.5736242Z configfile: pytest.ini 2025-12-04T10:05:37.5737655Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T10:05:37.5738565Z collecting ... collected 380 items 2025-12-04T10:05:37.5739039Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T10:05:37.5760838Z Running 50 items in this shard: test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction, test/inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction 2025-12-04T10:05:37.5782463Z 2025-12-04T10:05:37.5782934Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [1.0934s] [ 2%] 2025-12-04T10:05:37.5783963Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2084s] [ 2%] 2025-12-04T10:05:37.5784973Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2067s] [ 2%] 2025-12-04T10:05:37.5786110Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2055s] [ 2%] 2025-12-04T10:05:37.5787128Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2093s] [ 2%] 2025-12-04T10:05:37.5788130Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3329s] [ 2%] 2025-12-04T10:05:37.5789165Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3450s] [ 2%] 2025-12-04T10:05:37.5790228Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3533s] [ 2%] 2025-12-04T10:05:37.5791240Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3602s] [ 2%] 2025-12-04T10:05:37.5792244Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3681s] [ 2%] 2025-12-04T10:05:37.5793247Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3612s] [ 2%] 2025-12-04T10:05:37.5794318Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3650s] [ 2%] 2025-12-04T10:05:37.5795334Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3894s] [ 2%] 2025-12-04T10:05:37.5796412Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3626s] [ 2%] 2025-12-04T10:05:37.5797441Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3664s] [ 2%] 2025-12-04T10:05:37.5798449Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.6296s] [ 2%] 2025-12-04T10:05:37.5799454Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3363s] [ 2%] 2025-12-04T10:05:37.5800472Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3886s] [ 2%] 2025-12-04T10:05:37.5801493Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3734s] [ 2%] 2025-12-04T10:05:37.5802498Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3848s] [ 2%] 2025-12-04T10:05:37.5803501Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3846s] [ 2%] 2025-12-04T10:05:37.5804507Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3746s] [ 2%] 2025-12-04T10:05:37.5805565Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3845s] [ 2%] 2025-12-04T10:05:37.5806683Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3739s] [ 2%] 2025-12-04T10:05:37.5807683Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3478s] [ 2%] 2025-12-04T10:05:37.5808688Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.4126s] [ 2%] 2025-12-04T10:05:37.5809693Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.4053s] [ 2%] 2025-12-04T10:05:37.5810699Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3605s] [ 2%] 2025-12-04T10:05:37.5811696Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3353s] [ 2%] 2025-12-04T10:05:37.5812699Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3361s] [ 2%] 2025-12-04T10:05:37.5813700Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3568s] [ 2%] 2025-12-04T10:05:37.5814710Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3804s] [ 2%] 2025-12-04T10:05:37.5815709Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3980s] [ 2%] 2025-12-04T10:05:37.5816787Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3489s] [ 2%] 2025-12-04T10:05:37.5817789Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3634s] [ 2%] 2025-12-04T10:05:37.5818789Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3195s] [ 2%] 2025-12-04T10:05:37.5819845Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.3517s] [ 2%] 2025-12-04T10:05:37.5820894Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2083s] [ 2%] 2025-12-04T10:05:37.5821896Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2091s] [ 2%] 2025-12-04T10:05:37.5822901Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2022s] [ 2%] 2025-12-04T10:05:37.5823961Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2046s] [ 2%] 2025-12-04T10:05:37.5824965Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2048s] [ 2%] 2025-12-04T10:05:37.5826044Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2004s] [ 2%] 2025-12-04T10:05:37.5829249Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2109s] [ 2%] 2025-12-04T10:05:37.5830257Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2071s] [ 2%] 2025-12-04T10:05:37.5831257Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2015s] [ 2%] 2025-12-04T10:05:37.5832264Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2228s] [ 2%] 2025-12-04T10:05:37.5833273Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2084s] [ 2%] 2025-12-04T10:05:37.5834270Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2044s] [ 2%] 2025-12-04T10:05:37.5835266Z inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction FAILED [0.2017s] [ 2%] 2025-12-04T10:05:37.5835838Z 2025-12-04T10:05:37.5836147Z =================================== FAILURES =================================== 2025-12-04T10:05:37.5836747Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.5837353Z Traceback (most recent call last): 2025-12-04T10:05:37.5838126Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.5838979Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.5839812Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.5840600Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.5841440Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.5842326Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.5842852Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.5843128Z 2025-12-04T10:05:37.5843271Z Expected 0 but got 1. 2025-12-04T10:05:37.5843606Z Absolute difference: 1 2025-12-04T10:05:37.5843955Z Relative difference: inf 2025-12-04T10:05:37.5844179Z 2025-12-04T10:05:37.5844429Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.5845334Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.5846050Z 2025-12-04T10:05:37.5846353Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.5847031Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5847550Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5847975Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5849365Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5850955Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5851532Z graph_break [] 2025-12-04T10:05:37.5852009Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.5852591Z Traceback (most recent call last): 2025-12-04T10:05:37.5853362Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.5854263Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.5855074Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.5855845Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.5856744Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.5857630Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.5858142Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.5858417Z 2025-12-04T10:05:37.5858549Z Expected 0 but got 1. 2025-12-04T10:05:37.5858879Z Absolute difference: 1 2025-12-04T10:05:37.5859215Z Relative difference: inf 2025-12-04T10:05:37.5859429Z 2025-12-04T10:05:37.5859674Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.5860557Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.5861195Z 2025-12-04T10:05:37.5861477Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.5862132Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5862639Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5863055Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5864445Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5866041Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5866610Z graph_break [] 2025-12-04T10:05:37.5867031Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5867532Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5867940Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5868567Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5870068Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5871360Z graph_break [] 2025-12-04T10:05:37.5871821Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.5872388Z Traceback (most recent call last): 2025-12-04T10:05:37.5873141Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.5873989Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.5874789Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.5875551Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.5876429Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.5877359Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.5877870Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.5878152Z 2025-12-04T10:05:37.5878278Z Expected 0 but got 1. 2025-12-04T10:05:37.5878607Z Absolute difference: 1 2025-12-04T10:05:37.5878942Z Relative difference: inf 2025-12-04T10:05:37.5879165Z 2025-12-04T10:05:37.5879402Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.5880280Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.5880969Z 2025-12-04T10:05:37.5881256Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.5881910Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5882415Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5882827Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5884175Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5885657Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5886301Z graph_break [] 2025-12-04T10:05:37.5886719Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5887225Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5887643Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5888263Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5889746Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5891074Z graph_break [] 2025-12-04T10:05:37.5891491Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5892042Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5892450Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5893072Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5894557Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5895847Z graph_break [] 2025-12-04T10:05:37.5896364Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.5896934Z Traceback (most recent call last): 2025-12-04T10:05:37.5897693Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.5898550Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.5899362Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.5900133Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.5900966Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.5901853Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.5902367Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.5902647Z 2025-12-04T10:05:37.5902775Z Expected 0 but got 1. 2025-12-04T10:05:37.5903106Z Absolute difference: 1 2025-12-04T10:05:37.5903440Z Relative difference: inf 2025-12-04T10:05:37.5903658Z 2025-12-04T10:05:37.5903943Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.5904834Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.5905479Z 2025-12-04T10:05:37.5905822Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.5906525Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5907032Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5907443Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5908833Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5910313Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5910881Z graph_break [] 2025-12-04T10:05:37.5911296Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5911806Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5912220Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5912840Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5914322Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5915596Z graph_break [] 2025-12-04T10:05:37.5916068Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5916575Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5916984Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5917609Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5919136Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5920445Z graph_break [] 2025-12-04T10:05:37.5920855Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5921355Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5921765Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5922381Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5923871Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5925153Z graph_break [] 2025-12-04T10:05:37.5925611Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.5926229Z Traceback (most recent call last): 2025-12-04T10:05:37.5926984Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.5927829Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.5928632Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.5929397Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.5930223Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.5931088Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.5931640Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.5931910Z 2025-12-04T10:05:37.5932047Z Expected 0 but got 1. 2025-12-04T10:05:37.5932371Z Absolute difference: 1 2025-12-04T10:05:37.5932711Z Relative difference: inf 2025-12-04T10:05:37.5932924Z 2025-12-04T10:05:37.5933166Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.5934049Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.5934692Z 2025-12-04T10:05:37.5934979Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.5935659Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5936219Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5936623Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5937968Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5939449Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5940011Z graph_break [] 2025-12-04T10:05:37.5940416Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5940914Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5941321Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5941943Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5943433Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5944706Z graph_break [] 2025-12-04T10:05:37.5945154Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5945661Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5946184Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5946802Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5948287Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5949563Z graph_break [] 2025-12-04T10:05:37.5949974Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5950475Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5950883Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5951510Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5953006Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5954290Z graph_break [] 2025-12-04T10:05:37.5954699Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5955198Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5955606Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5956283Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5957758Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5959069Z graph_break [] 2025-12-04T10:05:37.5959529Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.5960092Z Traceback (most recent call last): 2025-12-04T10:05:37.5960845Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.5961689Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.5962491Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.5963308Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.5964145Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.5965016Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.5965519Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.5965799Z 2025-12-04T10:05:37.5965977Z Expected 0 but got 1. 2025-12-04T10:05:37.5966306Z Absolute difference: 1 2025-12-04T10:05:37.5966638Z Relative difference: inf 2025-12-04T10:05:37.5966859Z 2025-12-04T10:05:37.5967095Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.5967970Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.5968615Z 2025-12-04T10:05:37.5968895Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.5969545Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5970051Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5970461Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5971834Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5973310Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5973923Z graph_break [] 2025-12-04T10:05:37.5974337Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5974835Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5975244Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5975869Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5977399Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5978665Z graph_break [] 2025-12-04T10:05:37.5979083Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5979582Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5979994Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5980611Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5982089Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5983361Z graph_break [] 2025-12-04T10:05:37.5983772Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5984272Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5984678Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5985298Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5986875Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5988138Z graph_break [] 2025-12-04T10:05:37.5988547Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5989045Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5989448Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5990111Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5991582Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5992853Z graph_break [] 2025-12-04T10:05:37.5993265Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.5993770Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.5994176Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.5994794Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.5996445Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.5997830Z graph_break [] 2025-12-04T10:05:37.5998293Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.5998861Z Traceback (most recent call last): 2025-12-04T10:05:37.5999660Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6000513Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6001354Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6002121Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6002947Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6003819Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6004320Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6004589Z 2025-12-04T10:05:37.6004721Z Expected 0 but got 1. 2025-12-04T10:05:37.6005044Z Absolute difference: 1 2025-12-04T10:05:37.6005376Z Relative difference: inf 2025-12-04T10:05:37.6005591Z 2025-12-04T10:05:37.6005830Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6006774Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6007419Z 2025-12-04T10:05:37.6007710Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6008366Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6008865Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6009273Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6010618Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6012092Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6012660Z graph_break [] 2025-12-04T10:05:37.6013118Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6013616Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6014030Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6014647Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6016212Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6017483Z graph_break [] 2025-12-04T10:05:37.6017892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6018391Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6018794Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6019411Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6020901Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6022177Z graph_break [] 2025-12-04T10:05:37.6022584Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6023081Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6023487Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6024104Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6025578Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6026907Z graph_break [] 2025-12-04T10:05:37.6027358Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6027900Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6028303Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6028919Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6030402Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6031680Z graph_break [] 2025-12-04T10:05:37.6032086Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6032588Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6032990Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6033617Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6035201Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6070592Z graph_break [] 2025-12-04T10:05:37.6071054Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6071573Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6071995Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6072629Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6074234Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6075728Z graph_break [] 2025-12-04T10:05:37.6076270Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6076842Z Traceback (most recent call last): 2025-12-04T10:05:37.6077622Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6078483Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6079353Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6080132Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6080965Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6081841Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6082353Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6082639Z 2025-12-04T10:05:37.6082769Z Expected 0 but got 1. 2025-12-04T10:05:37.6083103Z Absolute difference: 1 2025-12-04T10:05:37.6083438Z Relative difference: inf 2025-12-04T10:05:37.6083662Z 2025-12-04T10:05:37.6083904Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6084794Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6085439Z 2025-12-04T10:05:37.6085735Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6086448Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6086961Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6087374Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6088795Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6090323Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6090889Z graph_break [] 2025-12-04T10:05:37.6091302Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6091799Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6092210Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6092826Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6094290Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6095566Z graph_break [] 2025-12-04T10:05:37.6096058Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6096565Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6096969Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6097583Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6099066Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6100335Z graph_break [] 2025-12-04T10:05:37.6100743Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6101244Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6101637Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6102024Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6102939Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6103728Z graph_break [] 2025-12-04T10:05:37.6103985Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6104322Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6104579Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6104963Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6105887Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6106726Z graph_break [] 2025-12-04T10:05:37.6106983Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6107295Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6107547Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6107935Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6108934Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6109789Z graph_break [] 2025-12-04T10:05:37.6110043Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6110357Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6110641Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6111054Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6112033Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6112883Z graph_break [] 2025-12-04T10:05:37.6113142Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6113458Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6113713Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6114099Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6115100Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6116009Z graph_break [] 2025-12-04T10:05:37.6116297Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6116652Z Traceback (most recent call last): 2025-12-04T10:05:37.6117128Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6117657Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6118161Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6118644Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6119201Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6119751Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6120066Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6120235Z 2025-12-04T10:05:37.6120318Z Expected 0 but got 1. 2025-12-04T10:05:37.6120521Z Absolute difference: 1 2025-12-04T10:05:37.6120728Z Relative difference: inf 2025-12-04T10:05:37.6120861Z 2025-12-04T10:05:37.6121012Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6121586Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6121989Z 2025-12-04T10:05:37.6122172Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6122576Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6122891Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6123147Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6123981Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6124901Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6125250Z graph_break [] 2025-12-04T10:05:37.6125507Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6125819Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6126119Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6126501Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6127444Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6128261Z graph_break [] 2025-12-04T10:05:37.6128517Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6128827Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6129077Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6129462Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6130396Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6131191Z graph_break [] 2025-12-04T10:05:37.6131449Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6131763Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6132016Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6132406Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6133321Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6134113Z graph_break [] 2025-12-04T10:05:37.6134365Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6134677Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6134929Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6135310Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6136299Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6137089Z graph_break [] 2025-12-04T10:05:37.6137345Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6137654Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6137906Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6138322Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6139316Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6140170Z graph_break [] 2025-12-04T10:05:37.6140431Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6140743Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6140995Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6141378Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6142360Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6143212Z graph_break [] 2025-12-04T10:05:37.6143468Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6143781Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6144032Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6144443Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6145446Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6146359Z graph_break [] 2025-12-04T10:05:37.6146613Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6146925Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6147179Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6147565Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6148554Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6149424Z graph_break [] 2025-12-04T10:05:37.6149715Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6150076Z Traceback (most recent call last): 2025-12-04T10:05:37.6150544Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6151076Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6151580Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6152063Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6152587Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6153167Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6153484Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6153658Z 2025-12-04T10:05:37.6153736Z Expected 0 but got 1. 2025-12-04T10:05:37.6153938Z Absolute difference: 1 2025-12-04T10:05:37.6154143Z Relative difference: inf 2025-12-04T10:05:37.6154280Z 2025-12-04T10:05:37.6154428Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6154978Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6155418Z 2025-12-04T10:05:37.6155596Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6156050Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6156362Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6156616Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6157457Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6158379Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6158729Z graph_break [] 2025-12-04T10:05:37.6158985Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6159299Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6159554Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6159937Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6160855Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6161675Z graph_break [] 2025-12-04T10:05:37.6161932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6162273Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6162526Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6162909Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6163832Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6164630Z graph_break [] 2025-12-04T10:05:37.6164891Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6165202Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6165458Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6165850Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6166843Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6167635Z graph_break [] 2025-12-04T10:05:37.6167891Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6168214Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6168472Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6168860Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6169794Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6170617Z graph_break [] 2025-12-04T10:05:37.6170877Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6171194Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6171450Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6171835Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6172857Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6173720Z graph_break [] 2025-12-04T10:05:37.6173981Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6174297Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6174554Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6174946Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6175975Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6176836Z graph_break [] 2025-12-04T10:05:37.6177097Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6177414Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6177670Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6178060Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6179085Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6179970Z graph_break [] 2025-12-04T10:05:37.6180229Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6180546Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6180803Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6181191Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6182175Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6183037Z graph_break [] 2025-12-04T10:05:37.6183296Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6183610Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6183862Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6184249Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6185234Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6186137Z graph_break [] 2025-12-04T10:05:37.6186426Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6186780Z Traceback (most recent call last): 2025-12-04T10:05:37.6187288Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6187814Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6188319Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6188802Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6189324Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6189902Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6190218Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6190391Z 2025-12-04T10:05:37.6190472Z Expected 0 but got 1. 2025-12-04T10:05:37.6190677Z Absolute difference: 1 2025-12-04T10:05:37.6190884Z Relative difference: inf 2025-12-04T10:05:37.6191019Z 2025-12-04T10:05:37.6191170Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6191719Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6192120Z 2025-12-04T10:05:37.6192301Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6192709Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6193023Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6193283Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6194120Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6195033Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6195386Z graph_break [] 2025-12-04T10:05:37.6195669Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6196024Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6196304Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6196691Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6197618Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6198408Z graph_break [] 2025-12-04T10:05:37.6198667Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6198979Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6199228Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6199618Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6200538Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6201332Z graph_break [] 2025-12-04T10:05:37.6201584Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6201895Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6202150Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6202536Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6203459Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6204284Z graph_break [] 2025-12-04T10:05:37.6204540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6204857Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6205110Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6205497Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6206479Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6207274Z graph_break [] 2025-12-04T10:05:37.6207531Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6207847Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6208102Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6208490Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6209477Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6210324Z graph_break [] 2025-12-04T10:05:37.6210579Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6210892Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6211145Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6211529Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6212537Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6213418Z graph_break [] 2025-12-04T10:05:37.6213674Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6213983Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6214235Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6214622Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6215603Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6216502Z graph_break [] 2025-12-04T10:05:37.6216760Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6217077Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6217332Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6217717Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6218697Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6219552Z graph_break [] 2025-12-04T10:05:37.6219809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6220121Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6220372Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6220755Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6221778Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6222635Z graph_break [] 2025-12-04T10:05:37.6222892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6223202Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6223480Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6223863Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6224840Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6225698Z graph_break [] 2025-12-04T10:05:37.6226027Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6226387Z Traceback (most recent call last): 2025-12-04T10:05:37.6226858Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6227385Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6227886Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6228364Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6228887Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6229440Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6229757Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6229928Z 2025-12-04T10:05:37.6230034Z Expected 0 but got 1. 2025-12-04T10:05:37.6230269Z Absolute difference: 1 2025-12-04T10:05:37.6230477Z Relative difference: inf 2025-12-04T10:05:37.6230617Z 2025-12-04T10:05:37.6230766Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6231317Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6231722Z 2025-12-04T10:05:37.6231900Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6232308Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6232625Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6232883Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6233723Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6234648Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6235003Z graph_break [] 2025-12-04T10:05:37.6235261Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6235574Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6235827Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6236264Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6237196Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6238015Z graph_break [] 2025-12-04T10:05:37.6238276Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6238589Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6238847Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6239234Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6240182Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6240974Z graph_break [] 2025-12-04T10:05:37.6241232Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6241545Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6241800Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6242195Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6243112Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6243905Z graph_break [] 2025-12-04T10:05:37.6244161Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6244474Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6244733Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6245119Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6246096Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6246921Z graph_break [] 2025-12-04T10:05:37.6247180Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6247524Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6247780Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6248165Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6249150Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6250006Z graph_break [] 2025-12-04T10:05:37.6250265Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6250575Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6250830Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6251215Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6252197Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6253057Z graph_break [] 2025-12-04T10:05:37.6253315Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6253630Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6253883Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6254267Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6255249Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6256178Z graph_break [] 2025-12-04T10:05:37.6256437Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6256751Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6257004Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6257420Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6258401Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6259251Z graph_break [] 2025-12-04T10:05:37.6259513Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6259828Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6260084Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6260470Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6261450Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6262305Z graph_break [] 2025-12-04T10:05:37.6262565Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6262880Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6263131Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6263519Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6264529Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6265406Z graph_break [] 2025-12-04T10:05:37.6265662Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6266013Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6266271Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6266657Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6267637Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6268497Z graph_break [] 2025-12-04T10:05:37.6268791Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6269147Z Traceback (most recent call last): 2025-12-04T10:05:37.6269617Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6270148Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6270659Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6271144Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6271751Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6272589Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6272939Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6273117Z 2025-12-04T10:05:37.6273227Z Expected 0 but got 1. 2025-12-04T10:05:37.6273500Z Absolute difference: 1 2025-12-04T10:05:37.6273715Z Relative difference: inf 2025-12-04T10:05:37.6273852Z 2025-12-04T10:05:37.6274026Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6274666Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6275072Z 2025-12-04T10:05:37.6275281Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6275692Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6276061Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6276317Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6277170Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6278098Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6278450Z graph_break [] 2025-12-04T10:05:37.6278705Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6279021Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6279279Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6279668Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6280596Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6281391Z graph_break [] 2025-12-04T10:05:37.6281678Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6282038Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6282299Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6282687Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6283620Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6284418Z graph_break [] 2025-12-04T10:05:37.6284562Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6284720Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6284851Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6285046Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6285494Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6285879Z graph_break [] 2025-12-04T10:05:37.6286046Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6286202Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6286329Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6286519Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6286967Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6287370Z graph_break [] 2025-12-04T10:05:37.6287503Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6287660Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6287788Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6287979Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6288473Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6288892Z graph_break [] 2025-12-04T10:05:37.6289021Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6289180Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6289309Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6289500Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6289988Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6290407Z graph_break [] 2025-12-04T10:05:37.6290537Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6290693Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6290820Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6291012Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6291503Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6291937Z graph_break [] 2025-12-04T10:05:37.6292065Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6292223Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6292352Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6292543Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6293021Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6293440Z graph_break [] 2025-12-04T10:05:37.6293571Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6293729Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6293858Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6294053Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6294537Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6294954Z graph_break [] 2025-12-04T10:05:37.6295082Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6295239Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6295367Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6295575Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6296098Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6296514Z graph_break [] 2025-12-04T10:05:37.6296644Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6296815Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6296943Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6297136Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6297611Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6298025Z graph_break [] 2025-12-04T10:05:37.6298152Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6298306Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6298432Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6298624Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6299111Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6299527Z graph_break [] 2025-12-04T10:05:37.6299674Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6299866Z Traceback (most recent call last): 2025-12-04T10:05:37.6300099Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6300371Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6300617Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6300853Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6301111Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6301378Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6301536Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6301623Z 2025-12-04T10:05:37.6301664Z Expected 0 but got 1. 2025-12-04T10:05:37.6301768Z Absolute difference: 1 2025-12-04T10:05:37.6301875Z Relative difference: inf 2025-12-04T10:05:37.6301942Z 2025-12-04T10:05:37.6302020Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6302289Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6302483Z 2025-12-04T10:05:37.6302573Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6302775Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6302933Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6303063Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6303473Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6303933Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6304110Z graph_break [] 2025-12-04T10:05:37.6304241Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6304399Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6304528Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6304719Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6305181Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6305569Z graph_break [] 2025-12-04T10:05:37.6305695Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6305847Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6306007Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6306196Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6306649Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6307030Z graph_break [] 2025-12-04T10:05:37.6307159Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6307313Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6307440Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6307633Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6308089Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6308487Z graph_break [] 2025-12-04T10:05:37.6308616Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6308769Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6308893Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6309085Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6309536Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6309917Z graph_break [] 2025-12-04T10:05:37.6310043Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6310200Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6310325Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6310516Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6310993Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6311410Z graph_break [] 2025-12-04T10:05:37.6311540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6311693Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6311817Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6312007Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6312500Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6312916Z graph_break [] 2025-12-04T10:05:37.6313042Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6313195Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6313322Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6313526Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6314009Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6314424Z graph_break [] 2025-12-04T10:05:37.6314553Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6314707Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6314831Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6315018Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6315497Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6315911Z graph_break [] 2025-12-04T10:05:37.6316082Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6316236Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6316364Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6316573Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6317065Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6317481Z graph_break [] 2025-12-04T10:05:37.6317608Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6317762Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6317887Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6318077Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6318556Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6318977Z graph_break [] 2025-12-04T10:05:37.6319104Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6319260Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6319391Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6319586Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6320064Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6320480Z graph_break [] 2025-12-04T10:05:37.6320628Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6320789Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6320918Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6321105Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6321597Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6322015Z graph_break [] 2025-12-04T10:05:37.6322144Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6322296Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6322419Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6322608Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6323090Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6323506Z graph_break [] 2025-12-04T10:05:37.6323651Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6323823Z Traceback (most recent call last): 2025-12-04T10:05:37.6324055Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6324311Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6324555Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6324788Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6325057Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6325335Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6325489Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6325575Z 2025-12-04T10:05:37.6325616Z Expected 0 but got 1. 2025-12-04T10:05:37.6325723Z Absolute difference: 1 2025-12-04T10:05:37.6325831Z Relative difference: inf 2025-12-04T10:05:37.6325902Z 2025-12-04T10:05:37.6326023Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6326293Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6326490Z 2025-12-04T10:05:37.6326579Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6326779Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6326939Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6327071Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6327485Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6327931Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6328109Z graph_break [] 2025-12-04T10:05:37.6328242Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6328401Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6328533Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6328730Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6329198Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6329582Z graph_break [] 2025-12-04T10:05:37.6329711Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6329866Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6329993Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6330200Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6330646Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6331029Z graph_break [] 2025-12-04T10:05:37.6331161Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6331318Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6331447Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6331639Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6332086Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6332469Z graph_break [] 2025-12-04T10:05:37.6332597Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6332755Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6332880Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6333068Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6333532Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6333931Z graph_break [] 2025-12-04T10:05:37.6334059Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6334213Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6334341Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6334532Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6335010Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6335433Z graph_break [] 2025-12-04T10:05:37.6335562Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6335719Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6335843Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6336074Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6336555Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6336966Z graph_break [] 2025-12-04T10:05:37.6337096Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6337252Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6337395Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6337586Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6338069Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6338488Z graph_break [] 2025-12-04T10:05:37.6338634Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6338791Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6338915Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6339104Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6339580Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6339998Z graph_break [] 2025-12-04T10:05:37.6340124Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6340279Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6340402Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6340593Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6341068Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6341483Z graph_break [] 2025-12-04T10:05:37.6341628Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6341787Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6341926Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6342118Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6342596Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6343014Z graph_break [] 2025-12-04T10:05:37.6343143Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6343299Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6343425Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6343616Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6344097Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6344512Z graph_break [] 2025-12-04T10:05:37.6344640Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6344793Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6344921Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6345115Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6345594Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6346056Z graph_break [] 2025-12-04T10:05:37.6346184Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6346342Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6346468Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6346656Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6347155Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6347569Z graph_break [] 2025-12-04T10:05:37.6347698Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6347857Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6347986Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6348175Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6348654Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6349066Z graph_break [] 2025-12-04T10:05:37.6349211Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6349384Z Traceback (most recent call last): 2025-12-04T10:05:37.6352669Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6352940Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6353218Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6353470Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6353721Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6353985Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6354136Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6354218Z 2025-12-04T10:05:37.6354260Z Expected 0 but got 1. 2025-12-04T10:05:37.6354364Z Absolute difference: 1 2025-12-04T10:05:37.6354468Z Relative difference: inf 2025-12-04T10:05:37.6354532Z 2025-12-04T10:05:37.6354605Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6354867Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6355061Z 2025-12-04T10:05:37.6355151Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6355348Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6355502Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6355557Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6355875Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6356019Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6356058Z graph_break [] 2025-12-04T10:05:37.6356130Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6356171Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6356240Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6356339Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6356651Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6356688Z graph_break [] 2025-12-04T10:05:37.6356759Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6356814Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6356867Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6356965Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6357274Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6357315Z graph_break [] 2025-12-04T10:05:37.6357386Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6357427Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6357480Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6357575Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6357885Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6357920Z graph_break [] 2025-12-04T10:05:37.6357995Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6358035Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6358102Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6358216Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6358525Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6358561Z graph_break [] 2025-12-04T10:05:37.6358634Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6358673Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6358727Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6358822Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6359170Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6359207Z graph_break [] 2025-12-04T10:05:37.6359278Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6359317Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6359372Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6359467Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6359806Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6359853Z graph_break [] 2025-12-04T10:05:37.6359929Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6359969Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6360023Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6360118Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6360485Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6360523Z graph_break [] 2025-12-04T10:05:37.6360594Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6360634Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6360686Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6360788Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6361130Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6361167Z graph_break [] 2025-12-04T10:05:37.6361237Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6361279Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6361331Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6361426Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6361774Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6361822Z graph_break [] 2025-12-04T10:05:37.6361893Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6361933Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6361984Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6362080Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6362419Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6362456Z graph_break [] 2025-12-04T10:05:37.6362528Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6362571Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6362623Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6362720Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6363059Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6363097Z graph_break [] 2025-12-04T10:05:37.6363170Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6363208Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6363261Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6363355Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6363706Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6363743Z graph_break [] 2025-12-04T10:05:37.6363814Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6363853Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6363905Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6364010Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6364352Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6364390Z graph_break [] 2025-12-04T10:05:37.6364463Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6364502Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6364556Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6364650Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6364990Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6365026Z graph_break [] 2025-12-04T10:05:37.6365099Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6365139Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6365193Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6365297Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6365651Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6365687Z graph_break [] 2025-12-04T10:05:37.6365776Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6365823Z Traceback (most recent call last): 2025-12-04T10:05:37.6366017Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6366089Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6366229Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6366291Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6366450Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6366522Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6366569Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6366571Z 2025-12-04T10:05:37.6366613Z Expected 0 but got 1. 2025-12-04T10:05:37.6366651Z Absolute difference: 1 2025-12-04T10:05:37.6366694Z Relative difference: inf 2025-12-04T10:05:37.6366696Z 2025-12-04T10:05:37.6366767Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6366924Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6366926Z 2025-12-04T10:05:37.6367012Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6367113Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6367156Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6367212Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6367525Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6367639Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6367675Z graph_break [] 2025-12-04T10:05:37.6367748Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6367788Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6367842Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6367939Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6368253Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6368290Z graph_break [] 2025-12-04T10:05:37.6368362Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6368402Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6368457Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6368552Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6368875Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6368914Z graph_break [] 2025-12-04T10:05:37.6368985Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6369039Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6369091Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6369186Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6369495Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6369532Z graph_break [] 2025-12-04T10:05:37.6369604Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6369645Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6369698Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6369794Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6370101Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6370138Z graph_break [] 2025-12-04T10:05:37.6370209Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6370251Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6370304Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6370401Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6370740Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6370792Z graph_break [] 2025-12-04T10:05:37.6370863Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6370903Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6370955Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6371050Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6371400Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6371436Z graph_break [] 2025-12-04T10:05:37.6371512Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6371553Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6371607Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6371703Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6372044Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6372079Z graph_break [] 2025-12-04T10:05:37.6372152Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6372192Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6372246Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6372340Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6372693Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6372741Z graph_break [] 2025-12-04T10:05:37.6372814Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6372852Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6372907Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6373002Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6373342Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6373379Z graph_break [] 2025-12-04T10:05:37.6373453Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6373494Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6373549Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6373645Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6373988Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6374023Z graph_break [] 2025-12-04T10:05:37.6374095Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6374147Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6374204Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6374299Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6374640Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6374679Z graph_break [] 2025-12-04T10:05:37.6374764Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6374805Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6374856Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6374953Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6375294Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6375333Z graph_break [] 2025-12-04T10:05:37.6375404Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6375444Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6375496Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6375592Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6375981Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6376018Z graph_break [] 2025-12-04T10:05:37.6376103Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6376145Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6376211Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6376307Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6376645Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6376681Z graph_break [] 2025-12-04T10:05:37.6376753Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6376795Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6376848Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6376946Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6377287Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6377323Z graph_break [] 2025-12-04T10:05:37.6377396Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6377437Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6377491Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6377588Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6377940Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6377993Z graph_break [] 2025-12-04T10:05:37.6378083Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6378127Z Traceback (most recent call last): 2025-12-04T10:05:37.6378279Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6378349Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6378501Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6378561Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6378719Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6378793Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6378843Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6378845Z 2025-12-04T10:05:37.6378885Z Expected 0 but got 1. 2025-12-04T10:05:37.6378924Z Absolute difference: 1 2025-12-04T10:05:37.6378962Z Relative difference: inf 2025-12-04T10:05:37.6378964Z 2025-12-04T10:05:37.6379037Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6379193Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6379195Z 2025-12-04T10:05:37.6379282Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6379354Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6379396Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6379450Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6379777Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6379893Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6379929Z graph_break [] 2025-12-04T10:05:37.6380001Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6380042Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6380097Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6380191Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6380503Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6380540Z graph_break [] 2025-12-04T10:05:37.6380612Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6380652Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6380706Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6380801Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6381112Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6381147Z graph_break [] 2025-12-04T10:05:37.6381219Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6381259Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6381312Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6381418Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6381728Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6381764Z graph_break [] 2025-12-04T10:05:37.6381836Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6381886Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6381940Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6382034Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6382348Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6382385Z graph_break [] 2025-12-04T10:05:37.6382457Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6382497Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6382550Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6382644Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6382985Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6383022Z graph_break [] 2025-12-04T10:05:37.6383093Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6383134Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6383198Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6383294Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6383640Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6383676Z graph_break [] 2025-12-04T10:05:37.6383749Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6383789Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6383841Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6383937Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6384277Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6384314Z graph_break [] 2025-12-04T10:05:37.6384383Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6384423Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6384475Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6384571Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6384910Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6384958Z graph_break [] 2025-12-04T10:05:37.6385030Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6385072Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6385124Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6385218Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6385564Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6385600Z graph_break [] 2025-12-04T10:05:37.6385670Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6385708Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6385762Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6385859Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6386238Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6386274Z graph_break [] 2025-12-04T10:05:37.6386346Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6386386Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6386440Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6386535Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6386891Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6386943Z graph_break [] 2025-12-04T10:05:37.6387015Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6387054Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6387107Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6387200Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6387544Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6387580Z graph_break [] 2025-12-04T10:05:37.6387654Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6387695Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6387749Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6387844Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6388187Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6388223Z graph_break [] 2025-12-04T10:05:37.6388296Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6388335Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6388389Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6388482Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6388834Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6388870Z graph_break [] 2025-12-04T10:05:37.6388942Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6388982Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6389050Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6389144Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6389482Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6389520Z graph_break [] 2025-12-04T10:05:37.6389590Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6389633Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6389685Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6389780Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6390123Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6390159Z graph_break [] 2025-12-04T10:05:37.6390229Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6390269Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6390323Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6390430Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6390782Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6390818Z graph_break [] 2025-12-04T10:05:37.6390906Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6390951Z Traceback (most recent call last): 2025-12-04T10:05:37.6391101Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6391171Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6391311Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6391371Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6391529Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6391601Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6391650Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6391652Z 2025-12-04T10:05:37.6391693Z Expected 0 but got 1. 2025-12-04T10:05:37.6391731Z Absolute difference: 1 2025-12-04T10:05:37.6391773Z Relative difference: inf 2025-12-04T10:05:37.6391775Z 2025-12-04T10:05:37.6391846Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6392002Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6392004Z 2025-12-04T10:05:37.6392102Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6392177Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6392219Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6392272Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6392590Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6392695Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6392733Z graph_break [] 2025-12-04T10:05:37.6392803Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6392844Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6392897Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6392993Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6393303Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6393339Z graph_break [] 2025-12-04T10:05:37.6393410Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6393452Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6393505Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6393600Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6393924Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6393962Z graph_break [] 2025-12-04T10:05:37.6394045Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6394085Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6394138Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6394235Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6394544Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6394581Z graph_break [] 2025-12-04T10:05:37.6394652Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6394693Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6394748Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6394844Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6395153Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6395188Z graph_break [] 2025-12-04T10:05:37.6395260Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6395300Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6395353Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6395447Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6395786Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6395833Z graph_break [] 2025-12-04T10:05:37.6395905Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6395991Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6396044Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6396153Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6396494Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6396530Z graph_break [] 2025-12-04T10:05:37.6396605Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6396645Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6396700Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6396793Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6397132Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6397168Z graph_break [] 2025-12-04T10:05:37.6397240Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6397279Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6397335Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6397429Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6397786Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6397838Z graph_break [] 2025-12-04T10:05:37.6397908Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6397949Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6398003Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6398100Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6398443Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6398481Z graph_break [] 2025-12-04T10:05:37.6398553Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6398594Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6398646Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6398740Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6399080Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6399116Z graph_break [] 2025-12-04T10:05:37.6399186Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6399243Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6399296Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6399391Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6399729Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6399765Z graph_break [] 2025-12-04T10:05:37.6399846Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6399888Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6399939Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6400035Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6400382Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6400419Z graph_break [] 2025-12-04T10:05:37.6400491Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6400530Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6400583Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6400678Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6401017Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6401054Z graph_break [] 2025-12-04T10:05:37.6401137Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6401187Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6401240Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6401333Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6401674Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6401709Z graph_break [] 2025-12-04T10:05:37.6401781Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6401821Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6401873Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6401969Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6402308Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6402344Z graph_break [] 2025-12-04T10:05:37.6402417Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6402458Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6402510Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6402605Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6402945Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6402995Z graph_break [] 2025-12-04T10:05:37.6403067Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6403107Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6403160Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6403255Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6403603Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6403639Z graph_break [] 2025-12-04T10:05:37.6403710Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6403754Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6403806Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6403903Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6404239Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6404277Z graph_break [] 2025-12-04T10:05:37.6404365Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6404412Z Traceback (most recent call last): 2025-12-04T10:05:37.6404565Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6404637Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6404786Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6404858Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6405015Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6405086Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6405134Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6405136Z 2025-12-04T10:05:37.6405178Z Expected 0 but got 1. 2025-12-04T10:05:37.6405216Z Absolute difference: 1 2025-12-04T10:05:37.6405257Z Relative difference: inf 2025-12-04T10:05:37.6405259Z 2025-12-04T10:05:37.6405329Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6405486Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6405489Z 2025-12-04T10:05:37.6405576Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6405650Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6405691Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6405745Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6406106Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6406201Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6406237Z graph_break [] 2025-12-04T10:05:37.6406308Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6406363Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6406417Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6406512Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6406823Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6406860Z graph_break [] 2025-12-04T10:05:37.6406951Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6406991Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6407043Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6407139Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6407448Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6407486Z graph_break [] 2025-12-04T10:05:37.6407556Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6407596Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6407648Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6407744Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6408058Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6408094Z graph_break [] 2025-12-04T10:05:37.6408165Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6408220Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6408273Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6408382Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6408691Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6408729Z graph_break [] 2025-12-04T10:05:37.6408801Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6408840Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6408893Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6408986Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6409330Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6409366Z graph_break [] 2025-12-04T10:05:37.6409438Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6409476Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6409532Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6409626Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6409965Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6410011Z graph_break [] 2025-12-04T10:05:37.6410084Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6410124Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6410178Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6410271Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6410622Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6410659Z graph_break [] 2025-12-04T10:05:37.6410733Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6410772Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6410826Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6410920Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6411259Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6411294Z graph_break [] 2025-12-04T10:05:37.6411367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6411406Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6411459Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6411554Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6411903Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6411950Z graph_break [] 2025-12-04T10:05:37.6412021Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6412061Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6412113Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6412207Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6412546Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6412582Z graph_break [] 2025-12-04T10:05:37.6412653Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6412694Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6412747Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6412842Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6413181Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6413218Z graph_break [] 2025-12-04T10:05:37.6413289Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6413329Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6413383Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6413478Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6413828Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6413864Z graph_break [] 2025-12-04T10:05:37.6413935Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6413975Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6414039Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6414135Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6414476Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6414513Z graph_break [] 2025-12-04T10:05:37.6414586Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6414625Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6414678Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6414771Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6415113Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6415148Z graph_break [] 2025-12-04T10:05:37.6415220Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6415260Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6415323Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6415417Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6415765Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6415801Z graph_break [] 2025-12-04T10:05:37.6415874Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6415913Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6416012Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6416106Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6416447Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6416483Z graph_break [] 2025-12-04T10:05:37.6416556Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6416595Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6416648Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6416743Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6417087Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6417138Z graph_break [] 2025-12-04T10:05:37.6417210Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6417250Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6417303Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6417397Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6417751Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6417788Z graph_break [] 2025-12-04T10:05:37.6417858Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6417898Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6417951Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6418048Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6418385Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6418423Z graph_break [] 2025-12-04T10:05:37.6418510Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6418556Z Traceback (most recent call last): 2025-12-04T10:05:37.6418707Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6418777Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6418916Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6418997Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6419155Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6419243Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6419291Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6419293Z 2025-12-04T10:05:37.6419332Z Expected 0 but got 1. 2025-12-04T10:05:37.6419370Z Absolute difference: 1 2025-12-04T10:05:37.6419411Z Relative difference: inf 2025-12-04T10:05:37.6419413Z 2025-12-04T10:05:37.6419485Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6419642Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6419644Z 2025-12-04T10:05:37.6419729Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6419804Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6419845Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6419900Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6420213Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6420309Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6420345Z graph_break [] 2025-12-04T10:05:37.6420416Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6420457Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6420510Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6420606Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6420932Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6420969Z graph_break [] 2025-12-04T10:05:37.6421040Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6421081Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6421133Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6421237Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6421545Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6421582Z graph_break [] 2025-12-04T10:05:37.6421654Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6421696Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6421749Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6421845Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6422155Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6422191Z graph_break [] 2025-12-04T10:05:37.6422261Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6422302Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6422354Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6422462Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6422773Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6422819Z graph_break [] 2025-12-04T10:05:37.6422893Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6422932Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6422986Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6423080Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6423425Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6423461Z graph_break [] 2025-12-04T10:05:37.6423534Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6423574Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6423630Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6423725Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6424068Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6424103Z graph_break [] 2025-12-04T10:05:37.6424175Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6424226Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6424281Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6424376Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6424719Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6424755Z graph_break [] 2025-12-04T10:05:37.6424838Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6424878Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6424933Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6425027Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6425372Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6425413Z graph_break [] 2025-12-04T10:05:37.6425485Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6425527Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6425579Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6425675Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6426063Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6426101Z graph_break [] 2025-12-04T10:05:37.6426191Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6426246Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6426300Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6426397Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6426738Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6426775Z graph_break [] 2025-12-04T10:05:37.6426846Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6426886Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6426939Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6427037Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6427380Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6427419Z graph_break [] 2025-12-04T10:05:37.6427491Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6427533Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6427585Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6427681Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6428021Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6428072Z graph_break [] 2025-12-04T10:05:37.6428144Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6428184Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6428239Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6428334Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6428687Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6428723Z graph_break [] 2025-12-04T10:05:37.6428797Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6428839Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6428894Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6428993Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6429343Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6429378Z graph_break [] 2025-12-04T10:05:37.6429451Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6429490Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6429546Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6429640Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6429991Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6430037Z graph_break [] 2025-12-04T10:05:37.6430110Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6430150Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6430205Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6430299Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6430645Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6430684Z graph_break [] 2025-12-04T10:05:37.6430755Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6430797Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6430850Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6430946Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6431287Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6431324Z graph_break [] 2025-12-04T10:05:37.6431395Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6431437Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6431503Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6431598Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6431937Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6431974Z graph_break [] 2025-12-04T10:05:37.6432055Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6432097Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6432150Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6432246Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6432584Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6432624Z graph_break [] 2025-12-04T10:05:37.6432695Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6432737Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6432789Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6432887Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6433227Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6433266Z graph_break [] 2025-12-04T10:05:37.6433368Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6433412Z Traceback (most recent call last): 2025-12-04T10:05:37.6433582Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6433652Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6433791Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6433851Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6434009Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6434079Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6434129Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6434131Z 2025-12-04T10:05:37.6434171Z Expected 0 but got 1. 2025-12-04T10:05:37.6434211Z Absolute difference: 1 2025-12-04T10:05:37.6434252Z Relative difference: inf 2025-12-04T10:05:37.6434254Z 2025-12-04T10:05:37.6434327Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6434484Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6434486Z 2025-12-04T10:05:37.6434575Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6434647Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6434691Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6434744Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6435058Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6435168Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6435207Z graph_break [] 2025-12-04T10:05:37.6435279Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6435321Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6435373Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6435468Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6435787Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6435825Z graph_break [] 2025-12-04T10:05:37.6435896Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6435986Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6436041Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6436137Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6436446Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6436482Z graph_break [] 2025-12-04T10:05:37.6436554Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6436594Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6436648Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6436742Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6437067Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6437117Z graph_break [] 2025-12-04T10:05:37.6437194Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6437233Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6437287Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6437383Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6437696Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6437732Z graph_break [] 2025-12-04T10:05:37.6437807Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6437846Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6437901Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6437996Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6438337Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6438374Z graph_break [] 2025-12-04T10:05:37.6438447Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6438487Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6438541Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6438653Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6438992Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6439029Z graph_break [] 2025-12-04T10:05:37.6439101Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6439142Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6439209Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6439308Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6439649Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6439686Z graph_break [] 2025-12-04T10:05:37.6439758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6439799Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6439851Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6439948Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6440286Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6440323Z graph_break [] 2025-12-04T10:05:37.6440394Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6440436Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6440501Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6440597Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6440945Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6440982Z graph_break [] 2025-12-04T10:05:37.6441053Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6441096Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6441149Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6441246Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6441590Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6441627Z graph_break [] 2025-12-04T10:05:37.6441700Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6441742Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6441795Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6441892Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6442233Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6442279Z graph_break [] 2025-12-04T10:05:37.6442352Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6442392Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6442445Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6442539Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6442889Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6442925Z graph_break [] 2025-12-04T10:05:37.6442998Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6443037Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6443091Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6443187Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6443530Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6443565Z graph_break [] 2025-12-04T10:05:37.6443638Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6443681Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6443735Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6443831Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6444181Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6444228Z graph_break [] 2025-12-04T10:05:37.6444300Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6444338Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6444393Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6444487Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6444829Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6444866Z graph_break [] 2025-12-04T10:05:37.6444937Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6444979Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6445033Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6445129Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6445469Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6445506Z graph_break [] 2025-12-04T10:05:37.6445578Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6445618Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6445671Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6445767Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6446157Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6446194Z graph_break [] 2025-12-04T10:05:37.6446266Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6446307Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6446377Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6446473Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6446811Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6446851Z graph_break [] 2025-12-04T10:05:37.6446922Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6446966Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6447018Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6447114Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6447454Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6447489Z graph_break [] 2025-12-04T10:05:37.6447561Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6447600Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6447655Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6447770Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6448121Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6448157Z graph_break [] 2025-12-04T10:05:37.6448230Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6448269Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6448322Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6448416Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6448757Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6448794Z graph_break [] 2025-12-04T10:05:37.6448882Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6448928Z Traceback (most recent call last): 2025-12-04T10:05:37.6449081Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6449152Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6449292Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6449350Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6449509Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6449592Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6449642Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6449644Z 2025-12-04T10:05:37.6449683Z Expected 0 but got 1. 2025-12-04T10:05:37.6449723Z Absolute difference: 1 2025-12-04T10:05:37.6449761Z Relative difference: inf 2025-12-04T10:05:37.6449763Z 2025-12-04T10:05:37.6449835Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6450003Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6450006Z 2025-12-04T10:05:37.6450092Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6450165Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6450205Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6450264Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6450574Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6450675Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6450710Z graph_break [] 2025-12-04T10:05:37.6450781Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6450822Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6450877Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6450970Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6451293Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6451341Z graph_break [] 2025-12-04T10:05:37.6451416Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6451456Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6451511Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6451605Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6451920Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6451955Z graph_break [] 2025-12-04T10:05:37.6452027Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6452068Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6452122Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6452216Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6452524Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6452560Z graph_break [] 2025-12-04T10:05:37.6452632Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6452672Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6452726Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6452820Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6453134Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6453182Z graph_break [] 2025-12-04T10:05:37.6453254Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6453296Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6453349Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6453445Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6453800Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6453838Z graph_break [] 2025-12-04T10:05:37.6453912Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6453954Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6454008Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6454103Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6454444Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6454480Z graph_break [] 2025-12-04T10:05:37.6454551Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6454592Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6454645Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6454740Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6455091Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6455139Z graph_break [] 2025-12-04T10:05:37.6455210Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6455251Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6455304Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6455401Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6455744Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6455781Z graph_break [] 2025-12-04T10:05:37.6455855Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6455894Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6455988Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6456083Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6456426Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6456461Z graph_break [] 2025-12-04T10:05:37.6456533Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6456589Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6456647Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6456742Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6457083Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6457118Z graph_break [] 2025-12-04T10:05:37.6457203Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6457242Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6457296Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6457390Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6457735Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6457772Z graph_break [] 2025-12-04T10:05:37.6457845Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6457885Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6457940Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6458037Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6458373Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6458411Z graph_break [] 2025-12-04T10:05:37.6458496Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6458552Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6458605Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6458701Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6459039Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6459076Z graph_break [] 2025-12-04T10:05:37.6459147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6459189Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6459243Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6459341Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6459680Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6459718Z graph_break [] 2025-12-04T10:05:37.6459790Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6459832Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6459885Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6459983Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6460324Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6460375Z graph_break [] 2025-12-04T10:05:37.6460446Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6460488Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6460542Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6460638Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6461002Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6461040Z graph_break [] 2025-12-04T10:05:37.6461113Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6461155Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6461212Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6461308Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6461646Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6461682Z graph_break [] 2025-12-04T10:05:37.6461755Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6461795Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6461853Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6461948Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6462303Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6462350Z graph_break [] 2025-12-04T10:05:37.6462424Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6462464Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6462518Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6462614Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6462954Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6462992Z graph_break [] 2025-12-04T10:05:37.6463065Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6463105Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6463159Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6463253Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6463591Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6463629Z graph_break [] 2025-12-04T10:05:37.6463701Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6463743Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6463807Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6463904Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6464245Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6464282Z graph_break [] 2025-12-04T10:05:37.6464365Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6464407Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6464460Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6464556Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6464899Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6464937Z graph_break [] 2025-12-04T10:05:37.6465025Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6465070Z Traceback (most recent call last): 2025-12-04T10:05:37.6465219Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6465292Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6465429Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6465490Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6465647Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6465732Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6465780Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6465793Z 2025-12-04T10:05:37.6465834Z Expected 0 but got 1. 2025-12-04T10:05:37.6465873Z Absolute difference: 1 2025-12-04T10:05:37.6465914Z Relative difference: inf 2025-12-04T10:05:37.6465916Z 2025-12-04T10:05:37.6466023Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6466184Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6466186Z 2025-12-04T10:05:37.6466271Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6466344Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6466385Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6466440Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6466753Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6466850Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6466888Z graph_break [] 2025-12-04T10:05:37.6466959Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6467002Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6467055Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6467151Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6467459Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6467511Z graph_break [] 2025-12-04T10:05:37.6467584Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6467626Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6467678Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6467776Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6468097Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6468135Z graph_break [] 2025-12-04T10:05:37.6468207Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6468250Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6468304Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6468402Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6468712Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6468751Z graph_break [] 2025-12-04T10:05:37.6468823Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6468865Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6468917Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6469013Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6469334Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6469386Z graph_break [] 2025-12-04T10:05:37.6469460Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6469499Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6469554Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6469648Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6469991Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6470028Z graph_break [] 2025-12-04T10:05:37.6470102Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6470143Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6470198Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6470296Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6470639Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6470674Z graph_break [] 2025-12-04T10:05:37.6470748Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6470788Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6470843Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6470938Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6471290Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6471328Z graph_break [] 2025-12-04T10:05:37.6471403Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6471443Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6471516Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6471611Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6471952Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6471989Z graph_break [] 2025-12-04T10:05:37.6472061Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6472103Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6472158Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6472253Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6472600Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6472637Z graph_break [] 2025-12-04T10:05:37.6472709Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6472751Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6472815Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6472912Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6473260Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6473297Z graph_break [] 2025-12-04T10:05:37.6473370Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6473411Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6473464Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6473560Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6473898Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6473937Z graph_break [] 2025-12-04T10:05:37.6474009Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6474050Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6474103Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6474200Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6474543Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6474603Z graph_break [] 2025-12-04T10:05:37.6474676Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6474717Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6474771Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6474869Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6475220Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6475256Z graph_break [] 2025-12-04T10:05:37.6475328Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6475369Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6475423Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6475520Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6475867Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6475903Z graph_break [] 2025-12-04T10:05:37.6476030Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6476070Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6476124Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6476219Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6476578Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6476627Z graph_break [] 2025-12-04T10:05:37.6476701Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6476740Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6476794Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6476889Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6477230Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6477266Z graph_break [] 2025-12-04T10:05:37.6477338Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6477380Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6477434Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6477530Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6477870Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6477908Z graph_break [] 2025-12-04T10:05:37.6477979Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6478020Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6478074Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6478170Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6478525Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6478566Z graph_break [] 2025-12-04T10:05:37.6478639Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6478680Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6478732Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6478840Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6479178Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6479219Z graph_break [] 2025-12-04T10:05:37.6479291Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6479333Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6479386Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6479482Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6479821Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6479859Z graph_break [] 2025-12-04T10:05:37.6479929Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6479971Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6480025Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6480133Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6480479Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6480515Z graph_break [] 2025-12-04T10:05:37.6480588Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6480627Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6480682Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6480778Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6481120Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6481158Z graph_break [] 2025-12-04T10:05:37.6481230Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6481270Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6481324Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6481419Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6481759Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6481794Z graph_break [] 2025-12-04T10:05:37.6481897Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6481942Z Traceback (most recent call last): 2025-12-04T10:05:37.6482097Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6482166Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6482305Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6482363Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6482533Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6482604Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6482652Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6482654Z 2025-12-04T10:05:37.6482694Z Expected 0 but got 1. 2025-12-04T10:05:37.6482736Z Absolute difference: 1 2025-12-04T10:05:37.6482777Z Relative difference: inf 2025-12-04T10:05:37.6482779Z 2025-12-04T10:05:37.6482851Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6483009Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6483012Z 2025-12-04T10:05:37.6483097Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6483170Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6483212Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6483267Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6483577Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6483686Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6483730Z graph_break [] 2025-12-04T10:05:37.6483804Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6483844Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6483900Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6483995Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6484305Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6484341Z graph_break [] 2025-12-04T10:05:37.6484413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6484454Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6484508Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6484604Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6484919Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6484955Z graph_break [] 2025-12-04T10:05:37.6485029Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6485068Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6485124Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6485219Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6485531Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6485581Z graph_break [] 2025-12-04T10:05:37.6485653Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6485693Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6485753Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6485856Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6486214Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6486251Z graph_break [] 2025-12-04T10:05:37.6486323Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6486368Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6486421Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6486518Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6486861Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6486902Z graph_break [] 2025-12-04T10:05:37.6486974Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6487015Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6487068Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6487167Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6487525Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6487587Z graph_break [] 2025-12-04T10:05:37.6487659Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6487700Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6487755Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6487853Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6488194Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6488235Z graph_break [] 2025-12-04T10:05:37.6488309Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6488351Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6488404Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6488503Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6488845Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6488886Z graph_break [] 2025-12-04T10:05:37.6488960Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6489000Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6489075Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6489171Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6489513Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6489550Z graph_break [] 2025-12-04T10:05:37.6489636Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6489677Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6489733Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6489828Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6490171Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6490208Z graph_break [] 2025-12-04T10:05:37.6490282Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6490323Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6490376Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6490472Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6490812Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6490850Z graph_break [] 2025-12-04T10:05:37.6490933Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6490972Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6491038Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6491134Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6491482Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6491521Z graph_break [] 2025-12-04T10:05:37.6491593Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6491636Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6491690Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6491791Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6492130Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6492169Z graph_break [] 2025-12-04T10:05:37.6492241Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6492284Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6492339Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6492436Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6492778Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6492828Z graph_break [] 2025-12-04T10:05:37.6492900Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6492943Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6492996Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6493094Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6493444Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6493483Z graph_break [] 2025-12-04T10:05:37.6493556Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6493600Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6493653Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6493757Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6494097Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6494139Z graph_break [] 2025-12-04T10:05:37.6502770Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6502819Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6502880Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6502980Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6503371Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6503423Z graph_break [] 2025-12-04T10:05:37.6503500Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6503542Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6503598Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6503697Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6504039Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6504076Z graph_break [] 2025-12-04T10:05:37.6504152Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6504194Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6504248Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6504342Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6504682Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6504719Z graph_break [] 2025-12-04T10:05:37.6504791Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6504833Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6504904Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6505001Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6505343Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6505378Z graph_break [] 2025-12-04T10:05:37.6505450Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6505505Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6505559Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6505653Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6506032Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6506070Z graph_break [] 2025-12-04T10:05:37.6506142Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6506182Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6506235Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6506330Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6506672Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6506710Z graph_break [] 2025-12-04T10:05:37.6506783Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6506837Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6506890Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6507000Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6507341Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6507376Z graph_break [] 2025-12-04T10:05:37.6507449Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6507489Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6507543Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6507637Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6507977Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6508014Z graph_break [] 2025-12-04T10:05:37.6508105Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6508152Z Traceback (most recent call last): 2025-12-04T10:05:37.6508316Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6508389Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6508537Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6508598Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6508773Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6508846Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6508896Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6508899Z 2025-12-04T10:05:37.6508939Z Expected 0 but got 1. 2025-12-04T10:05:37.6508980Z Absolute difference: 1 2025-12-04T10:05:37.6509020Z Relative difference: inf 2025-12-04T10:05:37.6509022Z 2025-12-04T10:05:37.6509108Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6509267Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6509269Z 2025-12-04T10:05:37.6509360Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6509434Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6509477Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6509532Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6509850Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6509948Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6509984Z graph_break [] 2025-12-04T10:05:37.6510059Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6510099Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6510154Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6510250Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6510575Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6510621Z graph_break [] 2025-12-04T10:05:37.6510695Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6510735Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6510790Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6510886Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6511198Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6511234Z graph_break [] 2025-12-04T10:05:37.6511307Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6511346Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6511402Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6511496Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6511805Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6511840Z graph_break [] 2025-12-04T10:05:37.6511913Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6511952Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6512005Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6512100Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6512421Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6512457Z graph_break [] 2025-12-04T10:05:37.6512533Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6512572Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6512638Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6512733Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6513076Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6513114Z graph_break [] 2025-12-04T10:05:37.6513185Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6513227Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6513280Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6513374Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6513714Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6513750Z graph_break [] 2025-12-04T10:05:37.6513821Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6513862Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6513916Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6514023Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6514381Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6514417Z graph_break [] 2025-12-04T10:05:37.6514489Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6514530Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6514583Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6514678Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6515020Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6515059Z graph_break [] 2025-12-04T10:05:37.6515130Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6515172Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6515225Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6515323Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6515661Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6515698Z graph_break [] 2025-12-04T10:05:37.6515782Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6515821Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6515876Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6516014Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6516374Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6516409Z graph_break [] 2025-12-04T10:05:37.6516483Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6516521Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6516575Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6516671Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6517011Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6517047Z graph_break [] 2025-12-04T10:05:37.6517119Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6517158Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6517215Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6517309Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6517668Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6517706Z graph_break [] 2025-12-04T10:05:37.6517790Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6517829Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6517884Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6517978Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6518318Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6518354Z graph_break [] 2025-12-04T10:05:37.6518425Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6518466Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6518520Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6518615Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6518954Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6518990Z graph_break [] 2025-12-04T10:05:37.6519061Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6519103Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6519157Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6519252Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6519608Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6519645Z graph_break [] 2025-12-04T10:05:37.6519718Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6519758Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6519811Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6519915Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6520253Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6520291Z graph_break [] 2025-12-04T10:05:37.6520362Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6520405Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6520457Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6520553Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6520892Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6520929Z graph_break [] 2025-12-04T10:05:37.6521000Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6521041Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6521093Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6521202Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6521556Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6521590Z graph_break [] 2025-12-04T10:05:37.6521664Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6521704Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6521758Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6521851Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6522192Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6522228Z graph_break [] 2025-12-04T10:05:37.6522299Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6522339Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6522393Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6522486Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6522825Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6522859Z graph_break [] 2025-12-04T10:05:37.6522932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6522983Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6523041Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6523138Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6523478Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6523522Z graph_break [] 2025-12-04T10:05:37.6523596Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6523635Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6523690Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6523784Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6524126Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6524163Z graph_break [] 2025-12-04T10:05:37.6524234Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6524274Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6524328Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6524423Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6524761Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6524809Z graph_break [] 2025-12-04T10:05:37.6524881Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6524932Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6524985Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6525079Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6525418Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6525454Z graph_break [] 2025-12-04T10:05:37.6525525Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6525568Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6525622Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6525719Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6526112Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6526149Z graph_break [] 2025-12-04T10:05:37.6526238Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6526283Z Traceback (most recent call last): 2025-12-04T10:05:37.6526436Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6526508Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6526664Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6526724Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6526884Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6526955Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6527003Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6527006Z 2025-12-04T10:05:37.6527044Z Expected 0 but got 1. 2025-12-04T10:05:37.6527084Z Absolute difference: 1 2025-12-04T10:05:37.6527135Z Relative difference: inf 2025-12-04T10:05:37.6527137Z 2025-12-04T10:05:37.6527210Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6527367Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6527369Z 2025-12-04T10:05:37.6527458Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6527530Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6527573Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6527626Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6527936Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6528033Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6528070Z graph_break [] 2025-12-04T10:05:37.6528140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6528180Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6528233Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6528343Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6528664Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6528702Z graph_break [] 2025-12-04T10:05:37.6528773Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6528816Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6528869Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6528963Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6529272Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6529309Z graph_break [] 2025-12-04T10:05:37.6529383Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6529421Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6529475Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6529568Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6529880Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6529916Z graph_break [] 2025-12-04T10:05:37.6529991Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6530030Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6530102Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6530198Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6530512Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6530547Z graph_break [] 2025-12-04T10:05:37.6530629Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6530669Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6530723Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6530816Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6531157Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6531193Z graph_break [] 2025-12-04T10:05:37.6531265Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6531303Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6531356Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6531450Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6531791Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6531826Z graph_break [] 2025-12-04T10:05:37.6531913Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6531954Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6532021Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6532116Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6532458Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6532495Z graph_break [] 2025-12-04T10:05:37.6532567Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6532608Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6532662Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6532758Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6533099Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6533135Z graph_break [] 2025-12-04T10:05:37.6533205Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6533246Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6533302Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6533397Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6533735Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6533782Z graph_break [] 2025-12-04T10:05:37.6533854Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6533894Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6533947Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6534044Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6534395Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6534432Z graph_break [] 2025-12-04T10:05:37.6534505Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6534548Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6534602Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6534699Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6535038Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6535073Z graph_break [] 2025-12-04T10:05:37.6535146Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6535186Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6535241Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6535335Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6535684Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6535729Z graph_break [] 2025-12-04T10:05:37.6535804Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6535844Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6535899Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6536031Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6536374Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6536409Z graph_break [] 2025-12-04T10:05:37.6536489Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6536530Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6536585Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6536679Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6537017Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6537052Z graph_break [] 2025-12-04T10:05:37.6537123Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6537162Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6537219Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6537329Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6537670Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6537707Z graph_break [] 2025-12-04T10:05:37.6537779Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6537830Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6537883Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6537978Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6538320Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6538358Z graph_break [] 2025-12-04T10:05:37.6538432Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6538473Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6538525Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6538621Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6538961Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6538997Z graph_break [] 2025-12-04T10:05:37.6539067Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6539122Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6539175Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6539284Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6539622Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6539660Z graph_break [] 2025-12-04T10:05:37.6539731Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6539771Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6539824Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6539918Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6540257Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6540295Z graph_break [] 2025-12-04T10:05:37.6540368Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6540408Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6540462Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6540557Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6540897Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6540945Z graph_break [] 2025-12-04T10:05:37.6541018Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6541059Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6541113Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6541207Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6541553Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6541589Z graph_break [] 2025-12-04T10:05:37.6541660Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6541700Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6541755Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6541851Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6542193Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6542228Z graph_break [] 2025-12-04T10:05:37.6542301Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6542340Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6542394Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6542489Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6542842Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6542899Z graph_break [] 2025-12-04T10:05:37.6542971Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6543013Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6543067Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6543163Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6543501Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6543537Z graph_break [] 2025-12-04T10:05:37.6543612Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6543652Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6543707Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6543802Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6544140Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6544176Z graph_break [] 2025-12-04T10:05:37.6544247Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6544287Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6544340Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6544445Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6544789Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6544826Z graph_break [] 2025-12-04T10:05:37.6544913Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6544961Z Traceback (most recent call last): 2025-12-04T10:05:37.6545126Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6545197Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6545336Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6545397Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6545557Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6545629Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6545677Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6545680Z 2025-12-04T10:05:37.6545719Z Expected 0 but got 1. 2025-12-04T10:05:37.6545757Z Absolute difference: 1 2025-12-04T10:05:37.6545797Z Relative difference: inf 2025-12-04T10:05:37.6545799Z 2025-12-04T10:05:37.6545873Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6546067Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6546069Z 2025-12-04T10:05:37.6546157Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6546230Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6546285Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6546339Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6546664Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6546761Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6546799Z graph_break [] 2025-12-04T10:05:37.6546870Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6546910Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6546965Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6547059Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6547371Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6547408Z graph_break [] 2025-12-04T10:05:37.6547479Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6547520Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6547573Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6547671Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6547981Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6548032Z graph_break [] 2025-12-04T10:05:37.6548103Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6548145Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6548197Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6548292Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6548616Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6548653Z graph_break [] 2025-12-04T10:05:37.6548724Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6548763Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6548817Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6548913Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6549222Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6549257Z graph_break [] 2025-12-04T10:05:37.6549329Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6549369Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6549424Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6549518Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6549871Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6549908Z graph_break [] 2025-12-04T10:05:37.6549990Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6550029Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6550083Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6550178Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6550521Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6550557Z graph_break [] 2025-12-04T10:05:37.6550630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6550670Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6550724Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6550818Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6551159Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6551194Z graph_break [] 2025-12-04T10:05:37.6551267Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6551307Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6551360Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6551455Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6551796Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6551844Z graph_break [] 2025-12-04T10:05:37.6551915Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6551955Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6552007Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6552113Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6552449Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6552486Z graph_break [] 2025-12-04T10:05:37.6552557Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6552598Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6552651Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6552746Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6553085Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6553122Z graph_break [] 2025-12-04T10:05:37.6553194Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6553235Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6553288Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6553393Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6553735Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6553784Z graph_break [] 2025-12-04T10:05:37.6553855Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6553896Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6553950Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6554045Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6554386Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6554423Z graph_break [] 2025-12-04T10:05:37.6554496Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6554535Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6554588Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6554682Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6555022Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6555057Z graph_break [] 2025-12-04T10:05:37.6555128Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6555177Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6555231Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6555327Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6555666Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6555711Z graph_break [] 2025-12-04T10:05:37.6555785Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6555825Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6555879Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6556012Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6556353Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6556389Z graph_break [] 2025-12-04T10:05:37.6556460Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6556499Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6556553Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6556647Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6556986Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6557022Z graph_break [] 2025-12-04T10:05:37.6557113Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6557166Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6557219Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6557315Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6557652Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6557687Z graph_break [] 2025-12-04T10:05:37.6557758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6557798Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6557852Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6557949Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6558290Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6558327Z graph_break [] 2025-12-04T10:05:37.6558398Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6558440Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6558492Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6558586Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6558924Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6558977Z graph_break [] 2025-12-04T10:05:37.6559048Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6559088Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6559141Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6559238Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6559592Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6559627Z graph_break [] 2025-12-04T10:05:37.6559700Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6559740Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6559794Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6559888Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6560233Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6560268Z graph_break [] 2025-12-04T10:05:37.6560341Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6560380Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6560433Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6560527Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6560877Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6560923Z graph_break [] 2025-12-04T10:05:37.6560994Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6561034Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6561088Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6561182Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6561522Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6561558Z graph_break [] 2025-12-04T10:05:37.6561631Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6561670Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6561723Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6561816Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6562157Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6562193Z graph_break [] 2025-12-04T10:05:37.6562264Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6562316Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6562370Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6562465Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6562805Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6562842Z graph_break [] 2025-12-04T10:05:37.6562921Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6562962Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6563014Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6563108Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6563450Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6563487Z graph_break [] 2025-12-04T10:05:37.6563557Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6563597Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6563649Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6563744Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6564083Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6564120Z graph_break [] 2025-12-04T10:05:37.6564219Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6564275Z Traceback (most recent call last): 2025-12-04T10:05:37.6564427Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6564498Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6564635Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6564697Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6564856Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6564928Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6564976Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6564979Z 2025-12-04T10:05:37.6565017Z Expected 0 but got 1. 2025-12-04T10:05:37.6565057Z Absolute difference: 1 2025-12-04T10:05:37.6565096Z Relative difference: inf 2025-12-04T10:05:37.6565099Z 2025-12-04T10:05:37.6565171Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6565325Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6565327Z 2025-12-04T10:05:37.6565414Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6565486Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6565527Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6565580Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6565894Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6566067Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6566103Z graph_break [] 2025-12-04T10:05:37.6566174Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6566216Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6566268Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6566362Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6566688Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6566727Z graph_break [] 2025-12-04T10:05:37.6566800Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6566843Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6566896Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6566996Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6567305Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6567341Z graph_break [] 2025-12-04T10:05:37.6567413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6567453Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6567506Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6567601Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6567928Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6567976Z graph_break [] 2025-12-04T10:05:37.6568049Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6568088Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6568140Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6568236Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6568547Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6568584Z graph_break [] 2025-12-04T10:05:37.6568658Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6568697Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6568751Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6568844Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6569187Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6569222Z graph_break [] 2025-12-04T10:05:37.6569294Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6569333Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6569386Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6569494Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6569833Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6569871Z graph_break [] 2025-12-04T10:05:37.6569943Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6569997Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6570052Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6570146Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6570487Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6570524Z graph_break [] 2025-12-04T10:05:37.6570596Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6570636Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6570688Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6570784Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6571130Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6571167Z graph_break [] 2025-12-04T10:05:37.6571238Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6571280Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6571345Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6571451Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6571788Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6571825Z graph_break [] 2025-12-04T10:05:37.6571896Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6571938Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6571990Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6572084Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6572424Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6572462Z graph_break [] 2025-12-04T10:05:37.6572533Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6572573Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6572625Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6572721Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6573062Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6573109Z graph_break [] 2025-12-04T10:05:37.6573183Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6573223Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6573275Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6573371Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6573722Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6573758Z graph_break [] 2025-12-04T10:05:37.6573830Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6573869Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6573924Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6574019Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6574361Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6574395Z graph_break [] 2025-12-04T10:05:37.6574468Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6574507Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6574561Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6574654Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6575003Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6575049Z graph_break [] 2025-12-04T10:05:37.6575122Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6575161Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6575216Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6575313Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6575656Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6575691Z graph_break [] 2025-12-04T10:05:37.6575764Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6575805Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6575860Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6575997Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6576336Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6576372Z graph_break [] 2025-12-04T10:05:37.6576443Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6576482Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6576535Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6576633Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6576984Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6577022Z graph_break [] 2025-12-04T10:05:37.6577093Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6577134Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6577199Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6577295Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6577636Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6577673Z graph_break [] 2025-12-04T10:05:37.6577744Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6577788Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6577840Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6577935Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6578275Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6578311Z graph_break [] 2025-12-04T10:05:37.6578381Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6578422Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6578488Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6578584Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6578934Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6578970Z graph_break [] 2025-12-04T10:05:37.6579042Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6579082Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6579135Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6579229Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6579573Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6579609Z graph_break [] 2025-12-04T10:05:37.6579680Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6579719Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6579772Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6579867Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6580208Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6580254Z graph_break [] 2025-12-04T10:05:37.6580330Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6580369Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6580425Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6580518Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6580867Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6580903Z graph_break [] 2025-12-04T10:05:37.6580974Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6581013Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6581066Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6581162Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6581504Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6581541Z graph_break [] 2025-12-04T10:05:37.6581611Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6581653Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6581705Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6581801Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6582148Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6582195Z graph_break [] 2025-12-04T10:05:37.6582266Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6582306Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6582359Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6582454Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6582794Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6582830Z graph_break [] 2025-12-04T10:05:37.6582901Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6582942Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6582996Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6583093Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6583432Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6583469Z graph_break [] 2025-12-04T10:05:37.6583540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6583581Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6583634Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6583730Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6584085Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6584120Z graph_break [] 2025-12-04T10:05:37.6584209Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6584254Z Traceback (most recent call last): 2025-12-04T10:05:37.6584415Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6584485Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6584625Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6584685Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6584846Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6584916Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6584966Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6584969Z 2025-12-04T10:05:37.6585007Z Expected 0 but got 1. 2025-12-04T10:05:37.6585046Z Absolute difference: 1 2025-12-04T10:05:37.6585085Z Relative difference: inf 2025-12-04T10:05:37.6585087Z 2025-12-04T10:05:37.6585158Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6585315Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6585317Z 2025-12-04T10:05:37.6585403Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6585475Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6585517Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6585580Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6585892Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6586042Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6586078Z graph_break [] 2025-12-04T10:05:37.6586150Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6586190Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6586243Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6586339Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6586649Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6586687Z graph_break [] 2025-12-04T10:05:37.6586760Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6586799Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6586852Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6586947Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6587259Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6587294Z graph_break [] 2025-12-04T10:05:37.6587365Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6587420Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6587474Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6587569Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6587876Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6587923Z graph_break [] 2025-12-04T10:05:37.6587995Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6588034Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6588088Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6588182Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6588495Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6588531Z graph_break [] 2025-12-04T10:05:37.6588606Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6588647Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6588702Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6588797Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6589139Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6589175Z graph_break [] 2025-12-04T10:05:37.6589259Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6589311Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6589367Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6589460Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6589800Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6589837Z graph_break [] 2025-12-04T10:05:37.6589907Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6589949Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6590004Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6590101Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6590444Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6590480Z graph_break [] 2025-12-04T10:05:37.6590552Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6590593Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6590646Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6590740Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6591080Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6591131Z graph_break [] 2025-12-04T10:05:37.6591202Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6591242Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6591296Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6591390Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6591740Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6591778Z graph_break [] 2025-12-04T10:05:37.6591848Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6591891Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6591945Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6592041Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6592378Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6592414Z graph_break [] 2025-12-04T10:05:37.6592487Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6592526Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6592578Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6592672Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6593022Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6593068Z graph_break [] 2025-12-04T10:05:37.6593140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6593179Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6593233Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6593327Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6593664Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6593703Z graph_break [] 2025-12-04T10:05:37.6593778Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6593820Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6593873Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6593967Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6594308Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6594343Z graph_break [] 2025-12-04T10:05:37.6594416Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6594454Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6594521Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6594617Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6594959Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6594995Z graph_break [] 2025-12-04T10:05:37.6595076Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6595118Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6595172Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6595266Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6595604Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6595641Z graph_break [] 2025-12-04T10:05:37.6595712Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6595752Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6595805Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6595900Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6596275Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6596313Z graph_break [] 2025-12-04T10:05:37.6596400Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6596442Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6596518Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6596614Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6596956Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6596991Z graph_break [] 2025-12-04T10:05:37.6597062Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6597101Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6597155Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6597254Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6597590Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6597627Z graph_break [] 2025-12-04T10:05:37.6597699Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6597739Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6597792Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6597887Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6598228Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6598277Z graph_break [] 2025-12-04T10:05:37.6598352Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6598391Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6598445Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6598539Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6598890Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6598925Z graph_break [] 2025-12-04T10:05:37.6598998Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6599038Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6599092Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6599186Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6599525Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6599560Z graph_break [] 2025-12-04T10:05:37.6599634Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6599673Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6599727Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6599820Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6600171Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6600218Z graph_break [] 2025-12-04T10:05:37.6600290Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6600329Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6600383Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6600478Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6600819Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6600857Z graph_break [] 2025-12-04T10:05:37.6600929Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6600971Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6601024Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6601119Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6601458Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6601494Z graph_break [] 2025-12-04T10:05:37.6601565Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6601605Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6601669Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6601765Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6602106Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6602142Z graph_break [] 2025-12-04T10:05:37.6602212Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6602263Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6602317Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6602411Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6602750Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6602788Z graph_break [] 2025-12-04T10:05:37.6602861Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6602901Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6602954Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6603048Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6603389Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6603424Z graph_break [] 2025-12-04T10:05:37.6603496Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6603544Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6603600Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6603704Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6604043Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6604079Z graph_break [] 2025-12-04T10:05:37.6604150Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6604190Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6604243Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6604336Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6604678Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6604714Z graph_break [] 2025-12-04T10:05:37.6604802Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6604847Z Traceback (most recent call last): 2025-12-04T10:05:37.6605001Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6605071Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6605211Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6605270Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6605440Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6605511Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6605559Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6605562Z 2025-12-04T10:05:37.6605600Z Expected 0 but got 1. 2025-12-04T10:05:37.6605638Z Absolute difference: 1 2025-12-04T10:05:37.6605678Z Relative difference: inf 2025-12-04T10:05:37.6605680Z 2025-12-04T10:05:37.6605763Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6605919Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6605959Z 2025-12-04T10:05:37.6606045Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6606117Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6606161Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6606219Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6606530Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6606627Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6606662Z graph_break [] 2025-12-04T10:05:37.6606734Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6606774Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6606828Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6606923Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6607250Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6607298Z graph_break [] 2025-12-04T10:05:37.6607370Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6607409Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6607462Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6607557Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6607868Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6607903Z graph_break [] 2025-12-04T10:05:37.6607978Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6608019Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6608074Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6608167Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6608476Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6608511Z graph_break [] 2025-12-04T10:05:37.6608583Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6608622Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6608675Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6608769Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6609091Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6609133Z graph_break [] 2025-12-04T10:05:37.6609205Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6609247Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6609318Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6609421Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6609832Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6609891Z graph_break [] 2025-12-04T10:05:37.6609993Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6610047Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6610127Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6610235Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6610623Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6610674Z graph_break [] 2025-12-04T10:05:37.6610776Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6610827Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6610910Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6611040Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6611421Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6611483Z graph_break [] 2025-12-04T10:05:37.6611574Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6611637Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6611716Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6611845Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6612201Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6612278Z graph_break [] 2025-12-04T10:05:37.6612364Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6612436Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6612510Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6612638Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6612996Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6613058Z graph_break [] 2025-12-04T10:05:37.6613146Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6613231Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6613299Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6613427Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6613790Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6613846Z graph_break [] 2025-12-04T10:05:37.6613965Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6614022Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6614105Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6614214Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6614573Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6614635Z graph_break [] 2025-12-04T10:05:37.6614747Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6614800Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6614888Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6614997Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6615385Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6615446Z graph_break [] 2025-12-04T10:05:37.6615558Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6615610Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6615691Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6615792Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6616236Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6616287Z graph_break [] 2025-12-04T10:05:37.6616387Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6616442Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6616522Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6616665Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6617018Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6617087Z graph_break [] 2025-12-04T10:05:37.6617173Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6617242Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6617323Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6617456Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6617828Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6617902Z graph_break [] 2025-12-04T10:05:37.6617987Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6618060Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6618133Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6618273Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6618630Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6618693Z graph_break [] 2025-12-04T10:05:37.6618772Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6618863Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6618929Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6619060Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6619426Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6619468Z graph_break [] 2025-12-04T10:05:37.6619584Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6619637Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6619722Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6619848Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6620225Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6620291Z graph_break [] 2025-12-04T10:05:37.6620400Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6620454Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6620534Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6620644Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6621018Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6621079Z graph_break [] 2025-12-04T10:05:37.6621180Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6621234Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6621316Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6621417Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6621804Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6621866Z graph_break [] 2025-12-04T10:05:37.6621952Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6622034Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6622096Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6622250Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6622601Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6622677Z graph_break [] 2025-12-04T10:05:37.6622763Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6622832Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6622910Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6623040Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6623395Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6623458Z graph_break [] 2025-12-04T10:05:37.6623551Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6623623Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6623699Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6623823Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6624174Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6624255Z graph_break [] 2025-12-04T10:05:37.6624362Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6624433Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6624514Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6624627Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6624995Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6625037Z graph_break [] 2025-12-04T10:05:37.6625152Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6625205Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6625291Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6625400Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6625760Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6625821Z graph_break [] 2025-12-04T10:05:37.6625995Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6626053Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6626135Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6626244Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6626613Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6626716Z graph_break [] 2025-12-04T10:05:37.6626802Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6626871Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6626938Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6627078Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6627443Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6627509Z graph_break [] 2025-12-04T10:05:37.6627594Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6627662Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6627724Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6627871Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6628229Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6628292Z graph_break [] 2025-12-04T10:05:37.6628378Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6628442Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6628519Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6628671Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6629023Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6629099Z graph_break [] 2025-12-04T10:05:37.6629187Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6629266Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6629357Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6629466Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6629833Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6629886Z graph_break [] 2025-12-04T10:05:37.6630011Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6630079Z Traceback (most recent call last): 2025-12-04T10:05:37.6630259Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6630343Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6630514Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6630580Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6630792Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6630876Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6630964Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6630966Z 2025-12-04T10:05:37.6631022Z Expected 0 but got 1. 2025-12-04T10:05:37.6631102Z Absolute difference: 1 2025-12-04T10:05:37.6631166Z Relative difference: inf 2025-12-04T10:05:37.6631168Z 2025-12-04T10:05:37.6631277Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6631451Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6631454Z 2025-12-04T10:05:37.6631585Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6631672Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6631744Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6631819Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6632161Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6632286Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6632334Z graph_break [] 2025-12-04T10:05:37.6632441Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6632501Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6632587Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6632697Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6633032Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6633077Z graph_break [] 2025-12-04T10:05:37.6633206Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6633278Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6633360Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6633468Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6633800Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6633862Z graph_break [] 2025-12-04T10:05:37.6633977Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6634029Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6634112Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6634223Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6634566Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6634641Z graph_break [] 2025-12-04T10:05:37.6634727Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6634794Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6634862Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6634993Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6635325Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6635402Z graph_break [] 2025-12-04T10:05:37.6635488Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6635557Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6635622Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6635762Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6636171Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6636234Z graph_break [] 2025-12-04T10:05:37.6636320Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6636387Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6636466Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6636595Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6636950Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6637019Z graph_break [] 2025-12-04T10:05:37.6637104Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6637176Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6637263Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6637372Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6637759Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6637822Z graph_break [] 2025-12-04T10:05:37.6637933Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6637997Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6638085Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6638193Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6638563Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6638609Z graph_break [] 2025-12-04T10:05:37.6638733Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6638787Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6638874Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6638984Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6639344Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6639404Z graph_break [] 2025-12-04T10:05:37.6639519Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6639572Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6639656Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6639779Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6640153Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6640230Z graph_break [] 2025-12-04T10:05:37.6640316Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6640394Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6640462Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6640594Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6640956Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6641021Z graph_break [] 2025-12-04T10:05:37.6641107Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6641175Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6641235Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6641382Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6641734Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6641797Z graph_break [] 2025-12-04T10:05:37.6641882Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6641960Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6642042Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6642188Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6642554Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6642606Z graph_break [] 2025-12-04T10:05:37.6642704Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6642769Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6642858Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6642966Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6643339Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6643389Z graph_break [] 2025-12-04T10:05:37.6643496Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6643555Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6643637Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6643747Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6644117Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6644172Z graph_break [] 2025-12-04T10:05:37.6644293Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6644346Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6644432Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6644539Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6644910Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6644990Z graph_break [] 2025-12-04T10:05:37.6645074Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6645149Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6645216Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6645333Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6645698Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6645772Z graph_break [] 2025-12-04T10:05:37.6645859Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6645972Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6646039Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6646164Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6646556Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6646634Z graph_break [] 2025-12-04T10:05:37.6646722Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6646790Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6646852Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6646997Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6647348Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6647411Z graph_break [] 2025-12-04T10:05:37.6647516Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6647563Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6647666Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6647774Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6648141Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6648192Z graph_break [] 2025-12-04T10:05:37.6648285Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6648358Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6648445Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6648568Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6648936Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6648989Z graph_break [] 2025-12-04T10:05:37.6649097Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6649156Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6649253Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6649360Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6649736Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6649779Z graph_break [] 2025-12-04T10:05:37.6649901Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6649957Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6650038Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6650155Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6650516Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6650601Z graph_break [] 2025-12-04T10:05:37.6650685Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6650759Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6650836Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6650954Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6651330Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6651400Z graph_break [] 2025-12-04T10:05:37.6651493Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6651562Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6651628Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6651758Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6652120Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6652189Z graph_break [] 2025-12-04T10:05:37.6652277Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6652345Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6652408Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6652551Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6652919Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6652979Z graph_break [] 2025-12-04T10:05:37.6653080Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6655160Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6655221Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6655316Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6655674Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6655714Z graph_break [] 2025-12-04T10:05:37.6655788Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6655831Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6655885Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6656091Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6656432Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6656468Z graph_break [] 2025-12-04T10:05:37.6656540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6656582Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6656636Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6656732Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6657091Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6657144Z graph_break [] 2025-12-04T10:05:37.6657215Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6657257Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6657310Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6657406Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6657746Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6657781Z graph_break [] 2025-12-04T10:05:37.6657853Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6657895Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6657950Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6658046Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6658387Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6658424Z graph_break [] 2025-12-04T10:05:37.6658514Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6658561Z Traceback (most recent call last): 2025-12-04T10:05:37.6658720Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6658812Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6658958Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6659020Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6659180Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6659251Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6659301Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6659303Z 2025-12-04T10:05:37.6659357Z Expected 0 but got 1. 2025-12-04T10:05:37.6659397Z Absolute difference: 1 2025-12-04T10:05:37.6659438Z Relative difference: inf 2025-12-04T10:05:37.6659440Z 2025-12-04T10:05:37.6659514Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6659672Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6659675Z 2025-12-04T10:05:37.6659766Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6659840Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6659882Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6659937Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6660252Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6660350Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6660386Z graph_break [] 2025-12-04T10:05:37.6660458Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6660498Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6660566Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6660661Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6660987Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6661022Z graph_break [] 2025-12-04T10:05:37.6661097Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6661137Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6661191Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6661285Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6661596Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6661633Z graph_break [] 2025-12-04T10:05:37.6661705Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6661744Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6661798Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6661892Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6662206Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6662242Z graph_break [] 2025-12-04T10:05:37.6662315Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6662366Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6662421Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6662517Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6662830Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6662876Z graph_break [] 2025-12-04T10:05:37.6662951Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6662990Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6663045Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6663141Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6663487Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6663524Z graph_break [] 2025-12-04T10:05:37.6663596Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6663636Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6663688Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6663785Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6664125Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6664163Z graph_break [] 2025-12-04T10:05:37.6664244Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6664296Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6664349Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6664444Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6664783Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6664820Z graph_break [] 2025-12-04T10:05:37.6664891Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6664932Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6664985Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6665083Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6665423Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6665460Z graph_break [] 2025-12-04T10:05:37.6665532Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6665572Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6665625Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6665722Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6666104Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6666158Z graph_break [] 2025-12-04T10:05:37.6666230Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6666269Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6666325Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6666419Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6666773Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6666809Z graph_break [] 2025-12-04T10:05:37.6666882Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6666922Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6666976Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6667070Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6667414Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6667449Z graph_break [] 2025-12-04T10:05:37.6667522Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6667561Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6667616Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6667709Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6668063Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6668111Z graph_break [] 2025-12-04T10:05:37.6668184Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6668223Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6668278Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6668372Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6668711Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6668751Z graph_break [] 2025-12-04T10:05:37.6668823Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6668863Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6668917Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6669012Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6669354Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6669390Z graph_break [] 2025-12-04T10:05:37.6669461Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6669512Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6669566Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6669662Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6670002Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6670039Z graph_break [] 2025-12-04T10:05:37.6670120Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6670161Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6670214Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6670310Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6670650Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6670687Z graph_break [] 2025-12-04T10:05:37.6670758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6670799Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6670852Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6670948Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6671291Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6671329Z graph_break [] 2025-12-04T10:05:37.6671413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6671469Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6671525Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6671648Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6671996Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6672033Z graph_break [] 2025-12-04T10:05:37.6672105Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6672144Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6672203Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6672298Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6672642Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6672677Z graph_break [] 2025-12-04T10:05:37.6672750Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6672790Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6672844Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6672937Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6673277Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6673329Z graph_break [] 2025-12-04T10:05:37.6673404Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6673443Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6673497Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6673591Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6673942Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6673979Z graph_break [] 2025-12-04T10:05:37.6674050Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6674092Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6674145Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6674242Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6674580Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6674617Z graph_break [] 2025-12-04T10:05:37.6674687Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6674727Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6674780Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6674875Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6675226Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6675273Z graph_break [] 2025-12-04T10:05:37.6675343Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6675384Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6675437Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6675533Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6675873Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6675911Z graph_break [] 2025-12-04T10:05:37.6676038Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6676080Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6676133Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6676228Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6676568Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6676605Z graph_break [] 2025-12-04T10:05:37.6676676Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6676717Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6676795Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6676891Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6677232Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6677268Z graph_break [] 2025-12-04T10:05:37.6677361Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6677402Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6677455Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6677548Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6677892Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6677929Z graph_break [] 2025-12-04T10:05:37.6678001Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6678041Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6678095Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6678192Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6678530Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6678564Z graph_break [] 2025-12-04T10:05:37.6678655Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6678695Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6678764Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6678857Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6679198Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6679233Z graph_break [] 2025-12-04T10:05:37.6679305Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6679344Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6679397Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6679491Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6679833Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6679870Z graph_break [] 2025-12-04T10:05:37.6679941Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6679981Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6680035Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6680131Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6680469Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6680517Z graph_break [] 2025-12-04T10:05:37.6680589Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6680629Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6680681Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6680776Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6681129Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6681166Z graph_break [] 2025-12-04T10:05:37.6681253Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6681300Z Traceback (most recent call last): 2025-12-04T10:05:37.6681454Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6681527Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6681668Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6681729Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6681887Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6681958Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6682006Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6682008Z 2025-12-04T10:05:37.6682048Z Expected 0 but got 1. 2025-12-04T10:05:37.6682086Z Absolute difference: 1 2025-12-04T10:05:37.6682128Z Relative difference: inf 2025-12-04T10:05:37.6682131Z 2025-12-04T10:05:37.6682216Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6682375Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6682388Z 2025-12-04T10:05:37.6682477Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6682550Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6682591Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6682645Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6682960Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6683056Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6683093Z graph_break [] 2025-12-04T10:05:37.6683165Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6683206Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6683260Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6683356Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6683666Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6683703Z graph_break [] 2025-12-04T10:05:37.6683773Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6683813Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6683866Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6683974Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6684284Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6684321Z graph_break [] 2025-12-04T10:05:37.6684391Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6684441Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6684494Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6684590Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6684897Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6684934Z graph_break [] 2025-12-04T10:05:37.6685005Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6685045Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6685097Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6685193Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6685502Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6685537Z graph_break [] 2025-12-04T10:05:37.6685610Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6685651Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6685721Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6685816Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6686222Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6686257Z graph_break [] 2025-12-04T10:05:37.6686330Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6686370Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6686423Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6686518Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6686865Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6686902Z graph_break [] 2025-12-04T10:05:37.6686973Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6687012Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6687066Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6687160Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6687500Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6687556Z graph_break [] 2025-12-04T10:05:37.6687629Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6687668Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6687723Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6687817Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6688169Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6688206Z graph_break [] 2025-12-04T10:05:37.6688276Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6688317Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6688370Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6688467Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6688809Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6688846Z graph_break [] 2025-12-04T10:05:37.6688917Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6688958Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6689010Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6689105Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6689457Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6689509Z graph_break [] 2025-12-04T10:05:37.6689579Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6689620Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6689673Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6689769Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6690108Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6690144Z graph_break [] 2025-12-04T10:05:37.6690215Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6690256Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6690309Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6690406Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6690748Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6690784Z graph_break [] 2025-12-04T10:05:37.6690862Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6690901Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6690954Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6691049Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6691398Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6691434Z graph_break [] 2025-12-04T10:05:37.6691507Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6691546Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6691599Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6691705Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6692044Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6692080Z graph_break [] 2025-12-04T10:05:37.6692153Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6692193Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6692247Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6692341Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6692684Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6692719Z graph_break [] 2025-12-04T10:05:37.6692792Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6692831Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6692886Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6692993Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6693343Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6693379Z graph_break [] 2025-12-04T10:05:37.6693450Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6693490Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6693543Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6693639Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6693979Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6694018Z graph_break [] 2025-12-04T10:05:37.6694089Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6694130Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6694182Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6694277Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6694616Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6694651Z graph_break [] 2025-12-04T10:05:37.6694733Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6694774Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6694829Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6694925Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6695278Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6695315Z graph_break [] 2025-12-04T10:05:37.6695385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6695426Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6695478Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6695573Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6695916Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6696003Z graph_break [] 2025-12-04T10:05:37.6696076Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6696115Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6696169Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6696263Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6696620Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6696656Z graph_break [] 2025-12-04T10:05:37.6696743Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6696782Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6696835Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6696928Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6697276Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6697311Z graph_break [] 2025-12-04T10:05:37.6697384Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6697424Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6697479Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6697573Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6697913Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6697948Z graph_break [] 2025-12-04T10:05:37.6698020Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6698059Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6698112Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6698206Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6698545Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6698598Z graph_break [] 2025-12-04T10:05:37.6698669Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6698709Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6698761Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6698875Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6699213Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6699252Z graph_break [] 2025-12-04T10:05:37.6699323Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6699365Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6699419Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6699513Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6699849Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6699885Z graph_break [] 2025-12-04T10:05:37.6699955Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6699995Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6700047Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6700154Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6700492Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6700539Z graph_break [] 2025-12-04T10:05:37.6700610Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6700651Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6700703Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6700801Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6701141Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6701178Z graph_break [] 2025-12-04T10:05:37.6701251Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6701291Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6701343Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6701439Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6701778Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6701814Z graph_break [] 2025-12-04T10:05:37.6701885Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6701935Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6701992Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6702086Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6702425Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6702472Z graph_break [] 2025-12-04T10:05:37.6702544Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6702583Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6702636Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6702731Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6703074Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6703110Z graph_break [] 2025-12-04T10:05:37.6703181Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6703221Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6703275Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6703370Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6703709Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6703745Z graph_break [] 2025-12-04T10:05:37.6703829Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6703879Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6703933Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6704028Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6704367Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6704403Z graph_break [] 2025-12-04T10:05:37.6704490Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6704535Z Traceback (most recent call last): 2025-12-04T10:05:37.6704687Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6704759Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6704898Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6704958Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6705116Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6705188Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6705236Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6705239Z 2025-12-04T10:05:37.6705277Z Expected 0 but got 1. 2025-12-04T10:05:37.6705315Z Absolute difference: 1 2025-12-04T10:05:37.6705356Z Relative difference: inf 2025-12-04T10:05:37.6705358Z 2025-12-04T10:05:37.6705429Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6705600Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6705603Z 2025-12-04T10:05:37.6705689Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6705764Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6705804Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6705861Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6706240Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6706340Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6706376Z graph_break [] 2025-12-04T10:05:37.6706450Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6706489Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6706544Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6706638Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6706950Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6706987Z graph_break [] 2025-12-04T10:05:37.6707058Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6707097Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6707152Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6707250Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6707571Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6707620Z graph_break [] 2025-12-04T10:05:37.6707690Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6707733Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6707787Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6707883Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6708193Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6708231Z graph_break [] 2025-12-04T10:05:37.6708305Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6708345Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6708397Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6708494Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6708803Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6708839Z graph_break [] 2025-12-04T10:05:37.6708911Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6708951Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6709003Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6709116Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6709456Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6709492Z graph_break [] 2025-12-04T10:05:37.6709564Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6709625Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6709678Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6709773Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6710114Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6710150Z graph_break [] 2025-12-04T10:05:37.6710224Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6710263Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6710317Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6710412Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6710756Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6710791Z graph_break [] 2025-12-04T10:05:37.6710865Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6710915Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6710971Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6711078Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6711418Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6711453Z graph_break [] 2025-12-04T10:05:37.6711526Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6711565Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6711619Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6711712Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6712055Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6712090Z graph_break [] 2025-12-04T10:05:37.6712163Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6712204Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6712260Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6712354Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6712695Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6712749Z graph_break [] 2025-12-04T10:05:37.6712819Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6712861Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6712914Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6713009Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6713357Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6713394Z graph_break [] 2025-12-04T10:05:37.6713465Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6713505Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6713559Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6713653Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6713993Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6714029Z graph_break [] 2025-12-04T10:05:37.6714100Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6714141Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6714194Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6714290Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6714640Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6714687Z graph_break [] 2025-12-04T10:05:37.6714758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6714800Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6714853Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6714949Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6715288Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6715324Z graph_break [] 2025-12-04T10:05:37.6715399Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6715438Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6715494Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6715587Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6715970Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6716005Z graph_break [] 2025-12-04T10:05:37.6716077Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6716115Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6716169Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6716280Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6716626Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6716662Z graph_break [] 2025-12-04T10:05:37.6716734Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6716792Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6716847Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6716941Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6717282Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6717319Z graph_break [] 2025-12-04T10:05:37.6717391Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6717430Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6717483Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6717577Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6717918Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6717954Z graph_break [] 2025-12-04T10:05:37.6718024Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6718066Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6718134Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6718244Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6718586Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6718623Z graph_break [] 2025-12-04T10:05:37.6718694Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6718735Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6718788Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6718883Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6719228Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6719265Z graph_break [] 2025-12-04T10:05:37.6719336Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6719376Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6719429Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6719526Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6719864Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6719911Z graph_break [] 2025-12-04T10:05:37.6719983Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6720024Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6720077Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6720172Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6720524Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6720560Z graph_break [] 2025-12-04T10:05:37.6720633Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6720672Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6720727Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6720822Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6721164Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6721198Z graph_break [] 2025-12-04T10:05:37.6721270Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6721310Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6721363Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6721457Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6721808Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6721854Z graph_break [] 2025-12-04T10:05:37.6721926Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6721965Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6722021Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6722115Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6722452Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6722487Z graph_break [] 2025-12-04T10:05:37.6722560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6722601Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6722654Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6722750Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6723088Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6723123Z graph_break [] 2025-12-04T10:05:37.6723196Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6723235Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6723288Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6723383Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6723735Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6723773Z graph_break [] 2025-12-04T10:05:37.6723844Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6723883Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6723947Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6724042Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6724382Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6724419Z graph_break [] 2025-12-04T10:05:37.6724489Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6724531Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6724583Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6724678Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6725019Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6725055Z graph_break [] 2025-12-04T10:05:37.6725126Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6725167Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6725255Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6725350Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6725709Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6725746Z graph_break [] 2025-12-04T10:05:37.6725820Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6725861Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6725914Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6726046Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6726386Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6726422Z graph_break [] 2025-12-04T10:05:37.6726494Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6726534Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6726589Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6726684Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6727033Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6727084Z graph_break [] 2025-12-04T10:05:37.6727157Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6727197Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6727252Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6727346Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6727700Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6727736Z graph_break [] 2025-12-04T10:05:37.6727807Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6727847Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6727902Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6727998Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6728335Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6728371Z graph_break [] 2025-12-04T10:05:37.6728460Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6728505Z Traceback (most recent call last): 2025-12-04T10:05:37.6728658Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6728728Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6728868Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6728942Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6729100Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6729186Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6729234Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6729236Z 2025-12-04T10:05:37.6729276Z Expected 0 but got 1. 2025-12-04T10:05:37.6729315Z Absolute difference: 1 2025-12-04T10:05:37.6729355Z Relative difference: inf 2025-12-04T10:05:37.6729356Z 2025-12-04T10:05:37.6729430Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6729589Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6729591Z 2025-12-04T10:05:37.6729678Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6729752Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6729792Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6729848Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6730158Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6730256Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6730291Z graph_break [] 2025-12-04T10:05:37.6730364Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6730403Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6730458Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6730552Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6730875Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6730912Z graph_break [] 2025-12-04T10:05:37.6730985Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6731025Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6731082Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6731187Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6731497Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6731536Z graph_break [] 2025-12-04T10:05:37.6731610Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6731651Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6731705Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6731800Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6732110Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6732146Z graph_break [] 2025-12-04T10:05:37.6732217Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6732257Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6732310Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6732418Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6732722Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6732769Z graph_break [] 2025-12-04T10:05:37.6732840Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6732883Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6732937Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6733031Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6733372Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6733409Z graph_break [] 2025-12-04T10:05:37.6733481Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6733523Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6733575Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6733670Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6734013Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6734048Z graph_break [] 2025-12-04T10:05:37.6734120Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6734172Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6734225Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6734322Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6734661Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6734696Z graph_break [] 2025-12-04T10:05:37.6734778Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6734818Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6734872Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6734966Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6735306Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6735341Z graph_break [] 2025-12-04T10:05:37.6735413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6735453Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6735506Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6735602Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6736089Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6736128Z graph_break [] 2025-12-04T10:05:37.6736216Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6736270Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6736324Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6736417Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6736759Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6736793Z graph_break [] 2025-12-04T10:05:37.6736866Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6736905Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6736958Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6737054Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6737394Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6737429Z graph_break [] 2025-12-04T10:05:37.6737500Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6737541Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6737593Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6737687Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6738034Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6738093Z graph_break [] 2025-12-04T10:05:37.6738163Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6738204Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6738256Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6738350Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6738703Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6738739Z graph_break [] 2025-12-04T10:05:37.6738811Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6738855Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6738909Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6739007Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6739344Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6739381Z graph_break [] 2025-12-04T10:05:37.6739451Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6739492Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6739544Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6739638Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6739997Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6740045Z graph_break [] 2025-12-04T10:05:37.6740118Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6740157Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6740213Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6740307Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6740646Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6740683Z graph_break [] 2025-12-04T10:05:37.6740755Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6740796Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6740850Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6740944Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6741284Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6741319Z graph_break [] 2025-12-04T10:05:37.6741390Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6741430Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6741497Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6741591Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6741935Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6741970Z graph_break [] 2025-12-04T10:05:37.6742053Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6742092Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6742147Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6742242Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6742585Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6742624Z graph_break [] 2025-12-04T10:05:37.6742694Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6742736Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6742788Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6742885Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6743223Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6743261Z graph_break [] 2025-12-04T10:05:37.6743343Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6743385Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6743450Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6743545Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6743884Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6743920Z graph_break [] 2025-12-04T10:05:37.6743990Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6744031Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6744083Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6744183Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6744526Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6744562Z graph_break [] 2025-12-04T10:05:37.6744633Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6744673Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6744728Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6744822Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6745162Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6745208Z graph_break [] 2025-12-04T10:05:37.6745280Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6745321Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6745373Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6745468Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6745817Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6745852Z graph_break [] 2025-12-04T10:05:37.6746183Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6746225Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6746279Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6746374Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6746715Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6746750Z graph_break [] 2025-12-04T10:05:37.6746822Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6746861Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6746914Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6747008Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6747365Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6747414Z graph_break [] 2025-12-04T10:05:37.6747486Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6747525Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6747578Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6747676Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6748016Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6748053Z graph_break [] 2025-12-04T10:05:37.6748125Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6748165Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6748219Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6748314Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6748656Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6748691Z graph_break [] 2025-12-04T10:05:37.6748762Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6748802Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6748868Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6748964Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6749303Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6749339Z graph_break [] 2025-12-04T10:05:37.6749409Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6749462Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6749516Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6749611Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6749950Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6749988Z graph_break [] 2025-12-04T10:05:37.6750058Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6750099Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6750152Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6750248Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6750586Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6750622Z graph_break [] 2025-12-04T10:05:37.6750695Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6750747Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6750800Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6750907Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6751249Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6751286Z graph_break [] 2025-12-04T10:05:37.6751358Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6751398Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6751452Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6751547Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6751888Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6751923Z graph_break [] 2025-12-04T10:05:37.6751995Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6752034Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6752090Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6752185Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6752524Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6752570Z graph_break [] 2025-12-04T10:05:37.6752642Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6752682Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6752736Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6752830Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6753190Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6753225Z graph_break [] 2025-12-04T10:05:37.6753318Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6753363Z Traceback (most recent call last): 2025-12-04T10:05:37.6753518Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6753589Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6753728Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6753789Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6753946Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6754018Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6754066Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6754068Z 2025-12-04T10:05:37.6754107Z Expected 0 but got 1. 2025-12-04T10:05:37.6754144Z Absolute difference: 1 2025-12-04T10:05:37.6754184Z Relative difference: inf 2025-12-04T10:05:37.6754187Z 2025-12-04T10:05:37.6754258Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6754426Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6754443Z 2025-12-04T10:05:37.6754529Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6754602Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6754642Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6754697Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6755009Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6755106Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6755143Z graph_break [] 2025-12-04T10:05:37.6755215Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6755256Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6755312Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6755406Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6755719Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6755755Z graph_break [] 2025-12-04T10:05:37.6755829Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6755869Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6755970Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6756081Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6756391Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6756427Z graph_break [] 2025-12-04T10:05:37.6756499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6756538Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6756613Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6756710Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6757020Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6757056Z graph_break [] 2025-12-04T10:05:37.6757127Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6757168Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6757222Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6757319Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6757628Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6760010Z graph_break [] 2025-12-04T10:05:37.6760083Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6760125Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6760178Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6760299Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6760657Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6760693Z graph_break [] 2025-12-04T10:05:37.6760767Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6760807Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6760861Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6760982Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6761325Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6761362Z graph_break [] 2025-12-04T10:05:37.6761435Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6761475Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6761528Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6761622Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6761962Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6762000Z graph_break [] 2025-12-04T10:05:37.6762072Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6762114Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6763098Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6763194Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6763546Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6763582Z graph_break [] 2025-12-04T10:05:37.6763654Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6763697Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6763749Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6763844Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6764185Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6764225Z graph_break [] 2025-12-04T10:05:37.6764297Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6764335Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6764390Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6764484Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6764909Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6764945Z graph_break [] 2025-12-04T10:05:37.6765019Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6765059Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6765115Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6765209Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6765558Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6765595Z graph_break [] 2025-12-04T10:05:37.6765667Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6765707Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6765763Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6765857Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6766246Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6766281Z graph_break [] 2025-12-04T10:05:37.6766355Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6766395Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6766450Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6766548Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6766888Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6766944Z graph_break [] 2025-12-04T10:05:37.6767016Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6767054Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6767109Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6767218Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6767557Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6767594Z graph_break [] 2025-12-04T10:05:37.6767667Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6767706Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6767761Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6767856Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6768197Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6768233Z graph_break [] 2025-12-04T10:05:37.6768304Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6768367Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6768419Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6768535Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6768874Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6768911Z graph_break [] 2025-12-04T10:05:37.6768982Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6769023Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6769075Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6769173Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6769517Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6769554Z graph_break [] 2025-12-04T10:05:37.6769625Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6769666Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6769718Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6769814Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6770154Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6770191Z graph_break [] 2025-12-04T10:05:37.6770261Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6770303Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6770356Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6770464Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6770804Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6770853Z graph_break [] 2025-12-04T10:05:37.6770925Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6770965Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6771020Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6771115Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6771457Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6771493Z graph_break [] 2025-12-04T10:05:37.6771566Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6771605Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6771661Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6771755Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6772095Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6772152Z graph_break [] 2025-12-04T10:05:37.6772225Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6772266Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6772320Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6772413Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6772753Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6772789Z graph_break [] 2025-12-04T10:05:37.6772864Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6772904Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6772958Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6773053Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6773398Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6773434Z graph_break [] 2025-12-04T10:05:37.6773506Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6773546Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6773599Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6773695Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6774033Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6774081Z graph_break [] 2025-12-04T10:05:37.6774153Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6774193Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6774245Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6774352Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6774688Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6774726Z graph_break [] 2025-12-04T10:05:37.6774799Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6774840Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6774893Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6774988Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6775331Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6775367Z graph_break [] 2025-12-04T10:05:37.6775438Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6775493Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6775548Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6775657Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6776037Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6776075Z graph_break [] 2025-12-04T10:05:37.6776147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6776186Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6776240Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6776335Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6776676Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6776711Z graph_break [] 2025-12-04T10:05:37.6776784Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6776823Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6776878Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6776972Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6777312Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6777348Z graph_break [] 2025-12-04T10:05:37.6777422Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6777461Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6777517Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6777612Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6777971Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6778006Z graph_break [] 2025-12-04T10:05:37.6778096Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6778138Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6778193Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6778289Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6778634Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6778670Z graph_break [] 2025-12-04T10:05:37.6778741Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6778782Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6778835Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6778931Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6779268Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6779322Z graph_break [] 2025-12-04T10:05:37.6779408Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6779451Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6779504Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6779601Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6779942Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6779978Z graph_break [] 2025-12-04T10:05:37.6780049Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6780090Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6780142Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6780241Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6780578Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6780615Z graph_break [] 2025-12-04T10:05:37.6780686Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6780729Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6780781Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6780877Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6781216Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6781264Z graph_break [] 2025-12-04T10:05:37.6781336Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6781375Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6781429Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6781523Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6781877Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6781914Z graph_break [] 2025-12-04T10:05:37.6782003Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6782050Z Traceback (most recent call last): 2025-12-04T10:05:37.6782203Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6782276Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6782416Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6782475Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6782637Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6782707Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6782776Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6782779Z 2025-12-04T10:05:37.6782818Z Expected 0 but got 1. 2025-12-04T10:05:37.6782858Z Absolute difference: 1 2025-12-04T10:05:37.6782897Z Relative difference: inf 2025-12-04T10:05:37.6782909Z 2025-12-04T10:05:37.6782983Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6783142Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6783145Z 2025-12-04T10:05:37.6783234Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6783306Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6783348Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6783401Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6783713Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6783810Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6783846Z graph_break [] 2025-12-04T10:05:37.6783918Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6783959Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6784011Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6784107Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6784417Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6784454Z graph_break [] 2025-12-04T10:05:37.6784527Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6784566Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6784621Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6784715Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6785037Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6785072Z graph_break [] 2025-12-04T10:05:37.6785155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6785195Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6785248Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6785343Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6785656Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6785691Z graph_break [] 2025-12-04T10:05:37.6785764Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6785804Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6785857Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6785988Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6786297Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6786348Z graph_break [] 2025-12-04T10:05:37.6786420Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6786475Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6786529Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6786625Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6786970Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6787007Z graph_break [] 2025-12-04T10:05:37.6787078Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6787117Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6787172Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6787268Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6787611Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6787648Z graph_break [] 2025-12-04T10:05:37.6787719Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6787759Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6787812Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6787908Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6788247Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6788285Z graph_break [] 2025-12-04T10:05:37.6788355Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6788414Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6788467Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6788566Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6788919Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6788955Z graph_break [] 2025-12-04T10:05:37.6789027Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6789069Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6789121Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6789220Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6789560Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6789596Z graph_break [] 2025-12-04T10:05:37.6789668Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6789708Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6789761Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6789869Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6790219Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6790255Z graph_break [] 2025-12-04T10:05:37.6790327Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6790366Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6790420Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6790514Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6790855Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6790891Z graph_break [] 2025-12-04T10:05:37.6790966Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6791006Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6791060Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6791154Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6791495Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6791530Z graph_break [] 2025-12-04T10:05:37.6791601Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6791643Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6791698Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6791792Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6792136Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6792186Z graph_break [] 2025-12-04T10:05:37.6792257Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6792296Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6792364Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6792459Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6792803Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6792839Z graph_break [] 2025-12-04T10:05:37.6792911Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6792953Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6793007Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6793102Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6793441Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6793490Z graph_break [] 2025-12-04T10:05:37.6793561Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6793601Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6793664Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6793761Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6794100Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6794139Z graph_break [] 2025-12-04T10:05:37.6794212Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6794252Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6794304Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6794404Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6794746Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6794783Z graph_break [] 2025-12-04T10:05:37.6794855Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6794896Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6794950Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6795047Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6795389Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6795428Z graph_break [] 2025-12-04T10:05:37.6795499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6795538Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6795603Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6795697Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6796094Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6796129Z graph_break [] 2025-12-04T10:05:37.6796200Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6796240Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6796295Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6796390Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6796733Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6796772Z graph_break [] 2025-12-04T10:05:37.6796843Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6796884Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6796940Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6797034Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6797409Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6797446Z graph_break [] 2025-12-04T10:05:37.6797519Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6797558Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6797612Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6797708Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6798046Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6798088Z graph_break [] 2025-12-04T10:05:37.6798160Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6798204Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6798257Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6798354Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6798693Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6798729Z graph_break [] 2025-12-04T10:05:37.6798801Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6798843Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6798898Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6798994Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6799336Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6799387Z graph_break [] 2025-12-04T10:05:37.6799458Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6799499Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6799563Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6799661Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6799999Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6800040Z graph_break [] 2025-12-04T10:05:37.6800111Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6800152Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6800206Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6800303Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6800644Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6800694Z graph_break [] 2025-12-04T10:05:37.6800768Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6800807Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6800862Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6800968Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6801310Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6801345Z graph_break [] 2025-12-04T10:05:37.6801418Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6804368Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6804433Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6804532Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6804883Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6804919Z graph_break [] 2025-12-04T10:05:37.6804992Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6805032Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6805086Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6805182Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6805521Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6805556Z graph_break [] 2025-12-04T10:05:37.6805630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6805669Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6805750Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6805845Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6806268Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6806303Z graph_break [] 2025-12-04T10:05:37.6806376Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6806416Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6806471Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6806566Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6806904Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6806942Z graph_break [] 2025-12-04T10:05:37.6807013Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6807053Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6807106Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6807201Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6807567Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6807603Z graph_break [] 2025-12-04T10:05:37.6807675Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6807715Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6807767Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6807862Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6808203Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6808240Z graph_break [] 2025-12-04T10:05:37.6808311Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6808353Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6808407Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6808502Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6808842Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6808878Z graph_break [] 2025-12-04T10:05:37.6808949Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6808990Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6809043Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6809139Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6809479Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6809528Z graph_break [] 2025-12-04T10:05:37.6809599Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6809638Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6809693Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6809799Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6810137Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6810173Z graph_break [] 2025-12-04T10:05:37.6810245Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6810284Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6810339Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6810434Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6810777Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6810812Z graph_break [] 2025-12-04T10:05:37.6810902Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6810961Z Traceback (most recent call last): 2025-12-04T10:05:37.6811140Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6811212Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6811357Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6811416Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6811577Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6811650Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6811700Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6811703Z 2025-12-04T10:05:37.6811742Z Expected 0 but got 1. 2025-12-04T10:05:37.6811783Z Absolute difference: 1 2025-12-04T10:05:37.6811824Z Relative difference: inf 2025-12-04T10:05:37.6811826Z 2025-12-04T10:05:37.6811900Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6812060Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6812064Z 2025-12-04T10:05:37.6812151Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6812225Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6812265Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6812320Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6812632Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6812731Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6812766Z graph_break [] 2025-12-04T10:05:37.6812838Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6812877Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6812931Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6813040Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6813367Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6813403Z graph_break [] 2025-12-04T10:05:37.6813474Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6813514Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6813567Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6813661Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6813971Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6814007Z graph_break [] 2025-12-04T10:05:37.6814079Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6814119Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6814175Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6814270Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6814579Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6814626Z graph_break [] 2025-12-04T10:05:37.6814709Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6814749Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6814803Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6814897Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6815206Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6815242Z graph_break [] 2025-12-04T10:05:37.6815312Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6815353Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6815405Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6815500Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6815839Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6815876Z graph_break [] 2025-12-04T10:05:37.6815980Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6816020Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6816073Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6816168Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6816509Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6816545Z graph_break [] 2025-12-04T10:05:37.6816632Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6816672Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6816724Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6816819Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6817179Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6817217Z graph_break [] 2025-12-04T10:05:37.6817289Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6817329Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6817382Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6817480Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6817820Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6817858Z graph_break [] 2025-12-04T10:05:37.6817929Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6817968Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6818035Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6818129Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6818480Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6818516Z graph_break [] 2025-12-04T10:05:37.6818587Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6818626Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6818679Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6818773Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6819112Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6819149Z graph_break [] 2025-12-04T10:05:37.6819220Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6819260Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6819314Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6819408Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6819751Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6819786Z graph_break [] 2025-12-04T10:05:37.6819859Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6819898Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6819951Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6820046Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6820396Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6820432Z graph_break [] 2025-12-04T10:05:37.6820514Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6820554Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6820607Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6820703Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6821041Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6821077Z graph_break [] 2025-12-04T10:05:37.6821147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6821188Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6821240Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6821336Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6821675Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6821722Z graph_break [] 2025-12-04T10:05:37.6821803Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6821843Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6821897Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6821993Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6822331Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6822367Z graph_break [] 2025-12-04T10:05:37.6822438Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6822479Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6822532Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6822626Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6822966Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6823003Z graph_break [] 2025-12-04T10:05:37.6823074Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6823112Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6823167Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6823260Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6823603Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6823638Z graph_break [] 2025-12-04T10:05:37.6823709Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6823761Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6823814Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6823908Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6824259Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6824295Z graph_break [] 2025-12-04T10:05:37.6824366Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6824405Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6824459Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6824553Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6824892Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6824927Z graph_break [] 2025-12-04T10:05:37.6825000Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6825039Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6825091Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6825209Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6825557Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6825595Z graph_break [] 2025-12-04T10:05:37.6825667Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6825707Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6825759Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6825855Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6826251Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6826288Z graph_break [] 2025-12-04T10:05:37.6826360Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6826400Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6826453Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6826548Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6826885Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6826921Z graph_break [] 2025-12-04T10:05:37.6826991Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6827032Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6827085Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6827180Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6827518Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6827577Z graph_break [] 2025-12-04T10:05:37.6827648Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6827703Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6827756Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6827852Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6828191Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6828228Z graph_break [] 2025-12-04T10:05:37.6828298Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6828339Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6828391Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6828486Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6828825Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6828874Z graph_break [] 2025-12-04T10:05:37.6828946Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6828999Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6829053Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6829147Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6829485Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6829521Z graph_break [] 2025-12-04T10:05:37.6829593Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6829632Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6829687Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6829781Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6830123Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6830159Z graph_break [] 2025-12-04T10:05:37.6830230Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6830269Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6830323Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6830417Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6830757Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6830793Z graph_break [] 2025-12-04T10:05:37.6830864Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6830914Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6830967Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6831061Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6831411Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6831447Z graph_break [] 2025-12-04T10:05:37.6831519Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6831560Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6831612Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6831708Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6832053Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6832089Z graph_break [] 2025-12-04T10:05:37.6832160Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6832200Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6832252Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6832359Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6832705Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6832742Z graph_break [] 2025-12-04T10:05:37.6832813Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6832853Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6832905Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6833001Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6833338Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6833375Z graph_break [] 2025-12-04T10:05:37.6833446Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6833487Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6833539Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6833634Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6833978Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6834014Z graph_break [] 2025-12-04T10:05:37.6834085Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6834125Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6834178Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6834273Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6834617Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6834663Z graph_break [] 2025-12-04T10:05:37.6834734Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6834773Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6834837Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6834931Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6835271Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6835305Z graph_break [] 2025-12-04T10:05:37.6835378Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6835420Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6835474Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6835567Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6835905Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6835997Z graph_break [] 2025-12-04T10:05:37.6836070Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6836109Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6836179Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6836275Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6836617Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6836653Z graph_break [] 2025-12-04T10:05:37.6836724Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6836765Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6836817Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6836913Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6837251Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6837288Z graph_break [] 2025-12-04T10:05:37.6837376Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6837422Z Traceback (most recent call last): 2025-12-04T10:05:37.6837574Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6837645Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6837783Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6837844Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6838002Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6838074Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6838141Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6838144Z 2025-12-04T10:05:37.6838183Z Expected 0 but got 1. 2025-12-04T10:05:37.6838220Z Absolute difference: 1 2025-12-04T10:05:37.6838262Z Relative difference: inf 2025-12-04T10:05:37.6838264Z 2025-12-04T10:05:37.6838335Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6838517Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6838519Z 2025-12-04T10:05:37.6838606Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6838680Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6838720Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6838775Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6839088Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6839186Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6839221Z graph_break [] 2025-12-04T10:05:37.6839292Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6839333Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6839385Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6839497Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6839815Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6839854Z graph_break [] 2025-12-04T10:05:37.6839924Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6839965Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6840019Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6840113Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6840424Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6840462Z graph_break [] 2025-12-04T10:05:37.6840532Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6840573Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6840625Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6840721Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6841032Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6841068Z graph_break [] 2025-12-04T10:05:37.6841140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6841180Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6841232Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6841327Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6841635Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6841683Z graph_break [] 2025-12-04T10:05:37.6841753Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6841795Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6841847Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6841953Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6842293Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6842329Z graph_break [] 2025-12-04T10:05:37.6842401Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6842440Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6842495Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6842588Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6842931Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6842966Z graph_break [] 2025-12-04T10:05:37.6843037Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6843087Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6843140Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6843244Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6843584Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6843620Z graph_break [] 2025-12-04T10:05:37.6843692Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6843732Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6843785Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6843878Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6844219Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6844254Z graph_break [] 2025-12-04T10:05:37.6844327Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6844369Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6844423Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6844517Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6844858Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6844895Z graph_break [] 2025-12-04T10:05:37.6844966Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6845005Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6845058Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6845167Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6845506Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6845552Z graph_break [] 2025-12-04T10:05:37.6845622Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6845662Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6845715Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6845810Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6846203Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6846241Z graph_break [] 2025-12-04T10:05:37.6846311Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6846352Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6846404Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6846500Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6846837Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6846892Z graph_break [] 2025-12-04T10:05:37.6846976Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6847018Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6847071Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6847166Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6847507Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6847543Z graph_break [] 2025-12-04T10:05:37.6847615Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6847655Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6847710Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6847804Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6848145Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6848180Z graph_break [] 2025-12-04T10:05:37.6848252Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6848291Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6848344Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6848437Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6848777Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6848828Z graph_break [] 2025-12-04T10:05:37.6848900Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6848939Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6848992Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6849086Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6849443Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6849479Z graph_break [] 2025-12-04T10:05:37.6849551Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6849591Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6849646Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6849740Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6850081Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6850118Z graph_break [] 2025-12-04T10:05:37.6850188Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6850241Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6850293Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6850387Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6850739Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6850776Z graph_break [] 2025-12-04T10:05:37.6850847Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6850887Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6850940Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6851034Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6851377Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6851413Z graph_break [] 2025-12-04T10:05:37.6851484Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6851525Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6851577Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6851671Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6852008Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6852045Z graph_break [] 2025-12-04T10:05:37.6852115Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6852156Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6852209Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6852305Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6852661Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6852698Z graph_break [] 2025-12-04T10:05:37.6852780Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6852821Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6852873Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6852968Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6853308Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6853343Z graph_break [] 2025-12-04T10:05:37.6853414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6853454Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6853507Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6853601Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6853941Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6853988Z graph_break [] 2025-12-04T10:05:37.6854070Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6854110Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6854165Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6854258Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6854597Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6854632Z graph_break [] 2025-12-04T10:05:37.6854703Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6854743Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6854798Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6854892Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6855231Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6855267Z graph_break [] 2025-12-04T10:05:37.6855338Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6855379Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6855432Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6855526Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6855868Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6855904Z graph_break [] 2025-12-04T10:05:37.6856029Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6856069Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6856122Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6856217Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6856569Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6856606Z graph_break [] 2025-12-04T10:05:37.6856677Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6856718Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6856770Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6856866Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6857204Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6857240Z graph_break [] 2025-12-04T10:05:37.6857310Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6857350Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6857416Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6857510Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6857864Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6857901Z graph_break [] 2025-12-04T10:05:37.6857972Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6858011Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6858063Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6858160Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6858498Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6858535Z graph_break [] 2025-12-04T10:05:37.6858606Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6858646Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6858699Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6858792Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6859131Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6859166Z graph_break [] 2025-12-04T10:05:37.6859238Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6859277Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6859331Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6859425Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6859778Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6859813Z graph_break [] 2025-12-04T10:05:37.6859896Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6859936Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6859989Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6860083Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6860421Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6860457Z graph_break [] 2025-12-04T10:05:37.6860528Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6860567Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6860621Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6860715Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6861055Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6861102Z graph_break [] 2025-12-04T10:05:37.6861183Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6861224Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6861276Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6861371Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6861714Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6861749Z graph_break [] 2025-12-04T10:05:37.6861820Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6861863Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6861916Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6862010Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6862348Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6862385Z graph_break [] 2025-12-04T10:05:37.6862455Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6862495Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6862547Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6862642Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6862980Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6863016Z graph_break [] 2025-12-04T10:05:37.6863087Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6863138Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6863190Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6863285Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6863635Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6863674Z graph_break [] 2025-12-04T10:05:37.6863747Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6863786Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6863840Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6863935Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6864274Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6864308Z graph_break [] 2025-12-04T10:05:37.6864398Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6864442Z Traceback (most recent call last): 2025-12-04T10:05:37.6864596Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6864677Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6864833Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6864892Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6865053Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6865122Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6865171Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6865173Z 2025-12-04T10:05:37.6865211Z Expected 0 but got 1. 2025-12-04T10:05:37.6865251Z Absolute difference: 1 2025-12-04T10:05:37.6865291Z Relative difference: inf 2025-12-04T10:05:37.6865293Z 2025-12-04T10:05:37.6865364Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6865521Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6865523Z 2025-12-04T10:05:37.6865610Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6865682Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6865724Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6865777Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6866131Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6866226Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6866263Z graph_break [] 2025-12-04T10:05:37.6866334Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6866375Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6866427Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6866524Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6866848Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6866883Z graph_break [] 2025-12-04T10:05:37.6866955Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6867010Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6867064Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6867158Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6867471Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6867506Z graph_break [] 2025-12-04T10:05:37.6867578Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6867616Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6867670Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6867764Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6868073Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6868122Z graph_break [] 2025-12-04T10:05:37.6868194Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6868233Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6868301Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6868395Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6868707Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6868741Z graph_break [] 2025-12-04T10:05:37.6868814Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6868853Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6868907Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6869001Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6869343Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6869380Z graph_break [] 2025-12-04T10:05:37.6869450Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6869490Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6869542Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6869640Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6869980Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6870016Z graph_break [] 2025-12-04T10:05:37.6870088Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6870129Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6870193Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6870288Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6870638Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6870674Z graph_break [] 2025-12-04T10:05:37.6870745Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6870786Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6870838Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6870934Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6871274Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6871311Z graph_break [] 2025-12-04T10:05:37.6871381Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6871422Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6871474Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6871570Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6871927Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6871964Z graph_break [] 2025-12-04T10:05:37.6872036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6872075Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6872129Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6872222Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6872562Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6872598Z graph_break [] 2025-12-04T10:05:37.6872670Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6872710Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6872764Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6872858Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6873197Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6873233Z graph_break [] 2025-12-04T10:05:37.6873304Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6873344Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6873398Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6873491Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6873830Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6873876Z graph_break [] 2025-12-04T10:05:37.6873948Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6873989Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6874042Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6874146Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6874485Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6874521Z graph_break [] 2025-12-04T10:05:37.6874594Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6874634Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6874687Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6874781Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6875125Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6875161Z graph_break [] 2025-12-04T10:05:37.6875244Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6875285Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6875337Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6875444Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6875783Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6875822Z graph_break [] 2025-12-04T10:05:37.6875893Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6875981Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6876034Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6876129Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6876468Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6876505Z graph_break [] 2025-12-04T10:05:37.6876576Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6876616Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6876668Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6876763Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6877104Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6877141Z graph_break [] 2025-12-04T10:05:37.6877212Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6877253Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6877306Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6877419Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6877775Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6877810Z graph_break [] 2025-12-04T10:05:37.6877882Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6877922Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6877976Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6878069Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6878408Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6878443Z graph_break [] 2025-12-04T10:05:37.6878514Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6878554Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6878610Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6878704Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6879078Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6879114Z graph_break [] 2025-12-04T10:05:37.6879186Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6879228Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6879281Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6879374Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6879712Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6879749Z graph_break [] 2025-12-04T10:05:37.6879820Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6879859Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6879914Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6880007Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6880347Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6880383Z graph_break [] 2025-12-04T10:05:37.6880454Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6880494Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6880546Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6880642Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6880980Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6881030Z graph_break [] 2025-12-04T10:05:37.6881101Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6881141Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6881193Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6881299Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6881636Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6881672Z graph_break [] 2025-12-04T10:05:37.6881744Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6881784Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6881838Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6881933Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6882272Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6882309Z graph_break [] 2025-12-04T10:05:37.6882381Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6882434Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6882487Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6882592Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6882931Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6882968Z graph_break [] 2025-12-04T10:05:37.6883040Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6883080Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6883133Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6883227Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6883568Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6883606Z graph_break [] 2025-12-04T10:05:37.6883679Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6883718Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6883771Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6883864Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6884205Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6884241Z graph_break [] 2025-12-04T10:05:37.6884312Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6884352Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6884405Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6884511Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6884848Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6884894Z graph_break [] 2025-12-04T10:05:37.6884966Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6885005Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6885059Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6885153Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6885494Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6885531Z graph_break [] 2025-12-04T10:05:37.6885602Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6885642Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6885694Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6885790Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6886166Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6886229Z graph_break [] 2025-12-04T10:05:37.6886300Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6886342Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6886395Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6886489Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6886829Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6886866Z graph_break [] 2025-12-04T10:05:37.6886938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6886978Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6887030Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6887125Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6887466Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6887502Z graph_break [] 2025-12-04T10:05:37.6887574Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6887614Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6887666Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6887760Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6888097Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6888147Z graph_break [] 2025-12-04T10:05:37.6888219Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6888259Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6888312Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6888406Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6888763Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6888801Z graph_break [] 2025-12-04T10:05:37.6888873Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6888914Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6888967Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6889062Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6889402Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6889437Z graph_break [] 2025-12-04T10:05:37.6889508Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6889559Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6889613Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6889707Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6890056Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6890092Z graph_break [] 2025-12-04T10:05:37.6890165Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6890203Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6890258Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6890354Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6890694Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6890729Z graph_break [] 2025-12-04T10:05:37.6890800Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6890842Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6890897Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6890991Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6891332Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6891370Z graph_break [] 2025-12-04T10:05:37.6891441Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6891481Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6891534Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6891632Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6891994Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6892030Z graph_break [] 2025-12-04T10:05:37.6892133Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6892180Z Traceback (most recent call last): 2025-12-04T10:05:37.6892331Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6892405Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6892544Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6892603Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6892760Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6892832Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6892879Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6892882Z 2025-12-04T10:05:37.6892920Z Expected 0 but got 1. 2025-12-04T10:05:37.6892961Z Absolute difference: 1 2025-12-04T10:05:37.6893001Z Relative difference: inf 2025-12-04T10:05:37.6893003Z 2025-12-04T10:05:37.6893074Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6893244Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6893246Z 2025-12-04T10:05:37.6893344Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6893418Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6893459Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6893514Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6893825Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6893921Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6893957Z graph_break [] 2025-12-04T10:05:37.6894029Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6894069Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6894122Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6894219Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6894528Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6894564Z graph_break [] 2025-12-04T10:05:37.6894634Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6894676Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6894728Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6894823Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6895133Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6895170Z graph_break [] 2025-12-04T10:05:37.6895251Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6895291Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6895344Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6895439Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6895762Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6895799Z graph_break [] 2025-12-04T10:05:37.6895870Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6895910Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6895987Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6896083Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6896396Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6896432Z graph_break [] 2025-12-04T10:05:37.6896503Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6896545Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6896598Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6896709Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6897063Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6897099Z graph_break [] 2025-12-04T10:05:37.6897172Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6897211Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6897263Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6897358Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6897698Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6897733Z graph_break [] 2025-12-04T10:05:37.6897805Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6897844Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6897898Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6897991Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6898334Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6898371Z graph_break [] 2025-12-04T10:05:37.6898443Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6898483Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6898536Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6898631Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6898971Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6899020Z graph_break [] 2025-12-04T10:05:37.6899091Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6899145Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6899200Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6899294Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6899637Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6899673Z graph_break [] 2025-12-04T10:05:37.6899746Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6899786Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6899838Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6899933Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6900274Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6900322Z graph_break [] 2025-12-04T10:05:37.6900393Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6900433Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6900496Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6900592Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6900931Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6900967Z graph_break [] 2025-12-04T10:05:37.6901038Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6901078Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6901131Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6901232Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6901571Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6901607Z graph_break [] 2025-12-04T10:05:37.6901678Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6901719Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6901772Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6901868Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6902209Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6902245Z graph_break [] 2025-12-04T10:05:37.6902317Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6902367Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6902422Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6902515Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6902869Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6902905Z graph_break [] 2025-12-04T10:05:37.6902978Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6903016Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6903070Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6903164Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6903503Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6903542Z graph_break [] 2025-12-04T10:05:37.6903615Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6903654Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6903708Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6903801Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6904165Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6904202Z graph_break [] 2025-12-04T10:05:37.6904274Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6904313Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6904367Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6904461Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6904802Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6904839Z graph_break [] 2025-12-04T10:05:37.6904910Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6904950Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6905002Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6905098Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6905435Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6905471Z graph_break [] 2025-12-04T10:05:37.6905541Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6905582Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6905634Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6905729Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6906111Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6906167Z graph_break [] 2025-12-04T10:05:37.6906237Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6906277Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6906343Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6906438Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6906778Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6906814Z graph_break [] 2025-12-04T10:05:37.6906887Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6906931Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6906984Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6907079Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6907418Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6907473Z graph_break [] 2025-12-04T10:05:37.6907545Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6907584Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6907653Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6907748Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6908094Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6908129Z graph_break [] 2025-12-04T10:05:37.6908201Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6908240Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6908294Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6908389Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6908728Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6908764Z graph_break [] 2025-12-04T10:05:37.6908835Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6908874Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6908927Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6909022Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6909362Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6909398Z graph_break [] 2025-12-04T10:05:37.6909470Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6909509Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6909577Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6909670Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6910018Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6910054Z graph_break [] 2025-12-04T10:05:37.6910126Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6910168Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6910222Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6910321Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6910660Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6910697Z graph_break [] 2025-12-04T10:05:37.6910768Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6910810Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6910862Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6910957Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6911316Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6911353Z graph_break [] 2025-12-04T10:05:37.6911427Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6911468Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6911520Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6911616Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6911955Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6911993Z graph_break [] 2025-12-04T10:05:37.6912064Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6912105Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6912159Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6912254Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6912592Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6912630Z graph_break [] 2025-12-04T10:05:37.6912701Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6912741Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6912795Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6912891Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6913232Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6913279Z graph_break [] 2025-12-04T10:05:37.6913351Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6913390Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6913443Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6913550Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6913888Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6913925Z graph_break [] 2025-12-04T10:05:37.6913998Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6914038Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6914092Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6914186Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6914529Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6914564Z graph_break [] 2025-12-04T10:05:37.6914646Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6914686Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6914741Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6914847Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6915189Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6915223Z graph_break [] 2025-12-04T10:05:37.6915296Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6915336Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6915389Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6915482Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6915823Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6915860Z graph_break [] 2025-12-04T10:05:37.6915971Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6916011Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6916064Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6916158Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6916495Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6916531Z graph_break [] 2025-12-04T10:05:37.6916602Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6916643Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6916696Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6916805Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6917160Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6917197Z graph_break [] 2025-12-04T10:05:37.6917268Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6917310Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6917362Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6917457Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6917797Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6917833Z graph_break [] 2025-12-04T10:05:37.6917905Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6917945Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6917998Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6918093Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6918464Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6918500Z graph_break [] 2025-12-04T10:05:37.6918574Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6918614Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6918668Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6918762Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6919102Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6919137Z graph_break [] 2025-12-04T10:05:37.6919211Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6919249Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6919304Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6919398Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6919740Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6919774Z graph_break [] 2025-12-04T10:05:37.6919847Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6919886Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6919940Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6920035Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6920373Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6920421Z graph_break [] 2025-12-04T10:05:37.6920510Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6920555Z Traceback (most recent call last): 2025-12-04T10:05:37.6920707Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6920790Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6920929Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6920990Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6921147Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6921219Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6921268Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6921270Z 2025-12-04T10:05:37.6921309Z Expected 0 but got 1. 2025-12-04T10:05:37.6921347Z Absolute difference: 1 2025-12-04T10:05:37.6921387Z Relative difference: inf 2025-12-04T10:05:37.6921389Z 2025-12-04T10:05:37.6921460Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6921619Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6921621Z 2025-12-04T10:05:37.6921707Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6921791Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6921831Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6921888Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6922209Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6922308Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6922343Z graph_break [] 2025-12-04T10:05:37.6922417Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6922457Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6922511Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6922605Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6922916Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6922952Z graph_break [] 2025-12-04T10:05:37.6923023Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6923062Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6923116Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6923209Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6923518Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6923554Z graph_break [] 2025-12-04T10:05:37.6923626Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6923666Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6923720Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6923815Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6924135Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6924171Z graph_break [] 2025-12-04T10:05:37.6924252Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6924294Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6924346Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6924443Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6924751Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6924789Z graph_break [] 2025-12-04T10:05:37.6924859Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6924900Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6924952Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6925049Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6925394Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6925442Z graph_break [] 2025-12-04T10:05:37.6925523Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6925564Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6925618Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6925713Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6926095Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6926132Z graph_break [] 2025-12-04T10:05:37.6926204Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6926245Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6926297Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6926394Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6926734Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6926770Z graph_break [] 2025-12-04T10:05:37.6926844Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6926883Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6926938Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6927033Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6927374Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6927410Z graph_break [] 2025-12-04T10:05:37.6927497Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6927536Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6927593Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6927686Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6928039Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6928075Z graph_break [] 2025-12-04T10:05:37.6928148Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6928189Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6928243Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6928338Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6928677Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6928712Z graph_break [] 2025-12-04T10:05:37.6928785Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6928824Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6928893Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6928987Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6929339Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6929376Z graph_break [] 2025-12-04T10:05:37.6929447Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6929487Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6929539Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6929635Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6929974Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6930011Z graph_break [] 2025-12-04T10:05:37.6930082Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6930123Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6930176Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6930271Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6930610Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6930650Z graph_break [] 2025-12-04T10:05:37.6930722Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6930762Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6930815Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6930911Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6931247Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6931295Z graph_break [] 2025-12-04T10:05:37.6931365Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6931417Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6931470Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6931566Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6931904Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6931942Z graph_break [] 2025-12-04T10:05:37.6932014Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6932053Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6932106Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6932200Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6932541Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6932593Z graph_break [] 2025-12-04T10:05:37.6932665Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6932716Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6932771Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6932866Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6933205Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6933240Z graph_break [] 2025-12-04T10:05:37.6933312Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6933352Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6933405Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6933499Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6933838Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6933874Z graph_break [] 2025-12-04T10:05:37.6933947Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6933987Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6934041Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6934134Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6934476Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6934513Z graph_break [] 2025-12-04T10:05:37.6934585Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6934636Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6934689Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6934784Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6935132Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6935169Z graph_break [] 2025-12-04T10:05:37.6935240Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6935280Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6935334Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6935429Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6935768Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6935804Z graph_break [] 2025-12-04T10:05:37.6935876Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6935916Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6936011Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6936123Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6936476Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6936514Z graph_break [] 2025-12-04T10:05:37.6936586Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6936626Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6936678Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6936774Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6937113Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6937149Z graph_break [] 2025-12-04T10:05:37.6937221Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6937261Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6937314Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6937409Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6937750Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6937785Z graph_break [] 2025-12-04T10:05:37.6937858Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6937898Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6937951Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6938046Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6938391Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6938438Z graph_break [] 2025-12-04T10:05:37.6938510Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6938562Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6938616Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6938710Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6939052Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6939087Z graph_break [] 2025-12-04T10:05:37.6939160Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6939199Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6939253Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6939346Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6939686Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6939734Z graph_break [] 2025-12-04T10:05:37.6939806Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6939845Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6939911Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6940005Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6940348Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6940385Z graph_break [] 2025-12-04T10:05:37.6940456Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6940497Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6940550Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6940644Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6940983Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6941020Z graph_break [] 2025-12-04T10:05:37.6941090Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6941130Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6941181Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6941276Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6941613Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6941650Z graph_break [] 2025-12-04T10:05:37.6941721Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6941773Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6941826Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6941922Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6942271Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6942308Z graph_break [] 2025-12-04T10:05:37.6942380Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6942421Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6942474Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6942569Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6942909Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6942944Z graph_break [] 2025-12-04T10:05:37.6943015Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6943054Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6943108Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6943202Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6943563Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6943600Z graph_break [] 2025-12-04T10:05:37.6943672Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6943711Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6943764Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6943858Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6944197Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6944233Z graph_break [] 2025-12-04T10:05:37.6944305Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6944344Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6944398Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6944494Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6944835Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6944870Z graph_break [] 2025-12-04T10:05:37.6944941Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6944981Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6945035Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6945128Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6945468Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6945516Z graph_break [] 2025-12-04T10:05:37.6945587Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6945628Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6945706Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6945802Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6946184Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6946221Z graph_break [] 2025-12-04T10:05:37.6946292Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6946335Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6946387Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6946482Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6946825Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6946875Z graph_break [] 2025-12-04T10:05:37.6946945Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6946985Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6947050Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6948989Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6949333Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6949370Z graph_break [] 2025-12-04T10:05:37.6949446Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6949487Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6949540Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6949636Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6949977Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6950013Z graph_break [] 2025-12-04T10:05:37.6950084Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6950124Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6950177Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6950273Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6950615Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6950651Z graph_break [] 2025-12-04T10:05:37.6950724Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6950763Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6950846Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6950940Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6951293Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6951329Z graph_break [] 2025-12-04T10:05:37.6951418Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6951464Z Traceback (most recent call last): 2025-12-04T10:05:37.6951621Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6951692Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6951833Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6951893Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6952053Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6952124Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6952173Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6952176Z 2025-12-04T10:05:37.6952215Z Expected 0 but got 1. 2025-12-04T10:05:37.6952255Z Absolute difference: 1 2025-12-04T10:05:37.6952307Z Relative difference: inf 2025-12-04T10:05:37.6952309Z 2025-12-04T10:05:37.6952381Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6952550Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6952553Z 2025-12-04T10:05:37.6952643Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6952715Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6952755Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6952809Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6953122Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6953220Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6953254Z graph_break [] 2025-12-04T10:05:37.6953326Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6953366Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6953421Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6953515Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6953825Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6953862Z graph_break [] 2025-12-04T10:05:37.6953934Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6953974Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6954029Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6954123Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6954433Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6954478Z graph_break [] 2025-12-04T10:05:37.6954550Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6954589Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6954643Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6954746Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6955055Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6955090Z graph_break [] 2025-12-04T10:05:37.6955164Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6955204Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6955260Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6955354Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6955665Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6955701Z graph_break [] 2025-12-04T10:05:37.6955771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6955823Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6955875Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6956018Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6956359Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6956395Z graph_break [] 2025-12-04T10:05:37.6956467Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6956507Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6956561Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6956655Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6956997Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6957033Z graph_break [] 2025-12-04T10:05:37.6957105Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6957145Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6957197Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6957292Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6957637Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6957673Z graph_break [] 2025-12-04T10:05:37.6957744Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6957785Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6957839Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6957934Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6958282Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6958318Z graph_break [] 2025-12-04T10:05:37.6958404Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6958445Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6958499Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6958593Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6958933Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6958969Z graph_break [] 2025-12-04T10:05:37.6959040Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6959079Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6959131Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6959225Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6959568Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6959617Z graph_break [] 2025-12-04T10:05:37.6959701Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6959742Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6959795Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6959889Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6960228Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6960263Z graph_break [] 2025-12-04T10:05:37.6960336Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6960375Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6960428Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6960522Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6960860Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6960896Z graph_break [] 2025-12-04T10:05:37.6960968Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6961008Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6961062Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6961156Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6961498Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6961555Z graph_break [] 2025-12-04T10:05:37.6961626Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6961666Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6961719Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6961813Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6962164Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6962201Z graph_break [] 2025-12-04T10:05:37.6962271Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6962312Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6962365Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6962461Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6962802Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6962838Z graph_break [] 2025-12-04T10:05:37.6962908Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6962948Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6963012Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6963107Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6963463Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6963500Z graph_break [] 2025-12-04T10:05:37.6963573Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6963613Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6963667Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6963762Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6964098Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6964134Z graph_break [] 2025-12-04T10:05:37.6964206Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6964246Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6964299Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6964394Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6964734Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6964769Z graph_break [] 2025-12-04T10:05:37.6964841Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6964880Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6964934Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6965028Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6965378Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6965413Z graph_break [] 2025-12-04T10:05:37.6965496Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6965535Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6965589Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6965683Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6966071Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6966106Z graph_break [] 2025-12-04T10:05:37.6966178Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6966217Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6966271Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6966365Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6966706Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6966756Z graph_break [] 2025-12-04T10:05:37.6966843Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6966883Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6966938Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6967032Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6967370Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6967406Z graph_break [] 2025-12-04T10:05:37.6967476Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6967517Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6967570Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6967665Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6968006Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6968042Z graph_break [] 2025-12-04T10:05:37.6968113Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6968152Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6968206Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6968300Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6968639Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6968675Z graph_break [] 2025-12-04T10:05:37.6968761Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6968800Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6968853Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6968947Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6969298Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6969335Z graph_break [] 2025-12-04T10:05:37.6969407Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6969446Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6969500Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6969594Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6969935Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6969970Z graph_break [] 2025-12-04T10:05:37.6970042Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6970081Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6970145Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6970238Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6970589Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6970625Z graph_break [] 2025-12-04T10:05:37.6970697Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6970736Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6970789Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6970883Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6971219Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6971254Z graph_break [] 2025-12-04T10:05:37.6971327Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6971366Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6971420Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6971514Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6971853Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6971889Z graph_break [] 2025-12-04T10:05:37.6971961Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6972001Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6972053Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6972149Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6972487Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6972534Z graph_break [] 2025-12-04T10:05:37.6972604Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6972655Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6972708Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6972803Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6973142Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6973179Z graph_break [] 2025-12-04T10:05:37.6973249Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6973289Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6973342Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6973436Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6973775Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6973821Z graph_break [] 2025-12-04T10:05:37.6973892Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6973952Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6974005Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6974101Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6974440Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6974476Z graph_break [] 2025-12-04T10:05:37.6974547Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6974588Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6974640Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6974735Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6975074Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6975110Z graph_break [] 2025-12-04T10:05:37.6975181Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6975220Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6975274Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6975368Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6975707Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6975743Z graph_break [] 2025-12-04T10:05:37.6975815Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6975866Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6975952Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6976047Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6976403Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6976439Z graph_break [] 2025-12-04T10:05:37.6976510Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6976549Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6976604Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6976698Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6977036Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6977070Z graph_break [] 2025-12-04T10:05:37.6977143Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6977182Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6977236Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6977343Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6977696Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6977733Z graph_break [] 2025-12-04T10:05:37.6977803Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6977843Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6977895Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6977990Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6978330Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6978366Z graph_break [] 2025-12-04T10:05:37.6978437Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6978477Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6978530Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6978624Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6978963Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6978998Z graph_break [] 2025-12-04T10:05:37.6979069Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6979110Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6979162Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6979257Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6979595Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6979645Z graph_break [] 2025-12-04T10:05:37.6979717Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6979767Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6979820Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6979915Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6980257Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6980292Z graph_break [] 2025-12-04T10:05:37.6980365Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6980404Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6980457Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6980550Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6980887Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6980932Z graph_break [] 2025-12-04T10:05:37.6981025Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.6981070Z Traceback (most recent call last): 2025-12-04T10:05:37.6981234Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.6981306Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.6981445Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.6981504Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.6981662Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.6981732Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.6981780Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.6981783Z 2025-12-04T10:05:37.6981822Z Expected 0 but got 1. 2025-12-04T10:05:37.6981861Z Absolute difference: 1 2025-12-04T10:05:37.6981900Z Relative difference: inf 2025-12-04T10:05:37.6981902Z 2025-12-04T10:05:37.6981974Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.6982129Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.6982132Z 2025-12-04T10:05:37.6982219Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.6982290Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6982330Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6982384Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6982698Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6982796Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6982832Z graph_break [] 2025-12-04T10:05:37.6982904Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6982955Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6983009Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6983103Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6983425Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6983460Z graph_break [] 2025-12-04T10:05:37.6983534Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6983573Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6983626Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6983720Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6984031Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6984066Z graph_break [] 2025-12-04T10:05:37.6984137Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6984177Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6984232Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6984326Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6984655Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6984690Z graph_break [] 2025-12-04T10:05:37.6984764Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6984803Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6984857Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6984951Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6985261Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6985297Z graph_break [] 2025-12-04T10:05:37.6985368Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6985407Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6985461Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6985555Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6985897Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6985995Z graph_break [] 2025-12-04T10:05:37.6986067Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6986107Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6986160Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6986255Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6986596Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6986648Z graph_break [] 2025-12-04T10:05:37.6986719Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6986759Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6986812Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6986921Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6987258Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6987295Z graph_break [] 2025-12-04T10:05:37.6987367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6987409Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6987463Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6987558Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6987897Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6987933Z graph_break [] 2025-12-04T10:05:37.6988003Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6988065Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6988118Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6988228Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6988566Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6988602Z graph_break [] 2025-12-04T10:05:37.6988674Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6988714Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6988767Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6988861Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6989202Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6989238Z graph_break [] 2025-12-04T10:05:37.6989309Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6989349Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6989403Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6989496Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6989835Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6989871Z graph_break [] 2025-12-04T10:05:37.6989942Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6989982Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6990036Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6990140Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6990491Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6990526Z graph_break [] 2025-12-04T10:05:37.6990598Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6990637Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6990692Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6990787Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6991126Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6991164Z graph_break [] 2025-12-04T10:05:37.6991235Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6991274Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6991328Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6991423Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6991760Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6991816Z graph_break [] 2025-12-04T10:05:37.6991887Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6991929Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6991981Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6992075Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6992413Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6992450Z graph_break [] 2025-12-04T10:05:37.6992522Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6992562Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6992614Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6992711Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6993052Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6993087Z graph_break [] 2025-12-04T10:05:37.6993159Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6993199Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6993251Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6993347Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6993683Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6993730Z graph_break [] 2025-12-04T10:05:37.6993802Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6993841Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6993894Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6993999Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6994337Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6994374Z graph_break [] 2025-12-04T10:05:37.6994448Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6994487Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6994542Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6994635Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6994974Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6995009Z graph_break [] 2025-12-04T10:05:37.6995080Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6995131Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6995185Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6995290Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6995631Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6995667Z graph_break [] 2025-12-04T10:05:37.6995739Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6995777Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6995832Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6995979Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6996317Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6996352Z graph_break [] 2025-12-04T10:05:37.6996426Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6996465Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6996519Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6996613Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6996954Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6996991Z graph_break [] 2025-12-04T10:05:37.6997062Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6997101Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6997155Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6997249Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6997598Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6997634Z graph_break [] 2025-12-04T10:05:37.6997719Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6997759Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6997814Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6997909Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6998247Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6998284Z graph_break [] 2025-12-04T10:05:37.6998355Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6998395Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6998448Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6998543Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6998883Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6998940Z graph_break [] 2025-12-04T10:05:37.6999024Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6999066Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6999119Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6999213Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.6999550Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.6999585Z graph_break [] 2025-12-04T10:05:37.6999660Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.6999700Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.6999755Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.6999849Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7000187Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7000223Z graph_break [] 2025-12-04T10:05:37.7000295Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7000334Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7000389Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7000482Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7000823Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7000877Z graph_break [] 2025-12-04T10:05:37.7000949Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7000988Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7001042Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7001136Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7001492Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7001528Z graph_break [] 2025-12-04T10:05:37.7001600Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7001640Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7001693Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7001788Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7002131Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7002167Z graph_break [] 2025-12-04T10:05:37.7002238Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7002289Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7002341Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7002436Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7002781Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7002820Z graph_break [] 2025-12-04T10:05:37.7002891Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7002931Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7002984Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7003078Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7003418Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7003455Z graph_break [] 2025-12-04T10:05:37.7003526Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7003567Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7003619Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7003713Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7004051Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7004088Z graph_break [] 2025-12-04T10:05:37.7004159Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7004200Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7004253Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7004348Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7004698Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7004735Z graph_break [] 2025-12-04T10:05:37.7004818Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7004858Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7004911Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7005006Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7005345Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7005380Z graph_break [] 2025-12-04T10:05:37.7005452Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7005491Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7005544Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7005639Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7006015Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7006064Z graph_break [] 2025-12-04T10:05:37.7006150Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7006189Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7006244Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7006337Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7006676Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7006711Z graph_break [] 2025-12-04T10:05:37.7006783Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7006822Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7006875Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7006969Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7007308Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7007344Z graph_break [] 2025-12-04T10:05:37.7007414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7007455Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7007507Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7007602Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7007941Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7007976Z graph_break [] 2025-12-04T10:05:37.7008061Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7008101Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7008154Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7008252Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7008611Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7008648Z graph_break [] 2025-12-04T10:05:37.7008719Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7008759Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7008813Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7008907Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7009247Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7009283Z graph_break [] 2025-12-04T10:05:37.7009355Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7009396Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7009461Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7009556Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7009902Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7009939Z graph_break [] 2025-12-04T10:05:37.7010011Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7010051Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7010104Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7010199Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7010541Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7010576Z graph_break [] 2025-12-04T10:05:37.7010648Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7010688Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7010742Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7010835Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7011175Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7011209Z graph_break [] 2025-12-04T10:05:37.7011299Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.7011343Z Traceback (most recent call last): 2025-12-04T10:05:37.7011496Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.7011566Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.7011717Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.7011774Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.7011935Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.7012015Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.7012064Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7012066Z 2025-12-04T10:05:37.7012104Z Expected 0 but got 1. 2025-12-04T10:05:37.7012143Z Absolute difference: 1 2025-12-04T10:05:37.7012183Z Relative difference: inf 2025-12-04T10:05:37.7012185Z 2025-12-04T10:05:37.7012256Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7012412Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7012415Z 2025-12-04T10:05:37.7012502Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7012575Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7012614Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7012667Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7012980Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7013087Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7013122Z graph_break [] 2025-12-04T10:05:37.7013204Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7013245Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7013300Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7013392Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7013703Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7013738Z graph_break [] 2025-12-04T10:05:37.7013809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7013849Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7013902Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7013996Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7014307Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7014342Z graph_break [] 2025-12-04T10:05:37.7014414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7014453Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7014508Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7014602Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7014909Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7014945Z graph_break [] 2025-12-04T10:05:37.7015017Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7015074Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7015128Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7015221Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7015541Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7015577Z graph_break [] 2025-12-04T10:05:37.7015648Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7015690Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7015742Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7015838Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7016213Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7016250Z graph_break [] 2025-12-04T10:05:37.7016320Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7016361Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7016415Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7016509Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7016875Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7016912Z graph_break [] 2025-12-04T10:05:37.7016983Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7017022Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7017075Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7017169Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7017507Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7017543Z graph_break [] 2025-12-04T10:05:37.7017614Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7017654Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7017707Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7017803Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7018140Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7018176Z graph_break [] 2025-12-04T10:05:37.7018249Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7018287Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7018341Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7018436Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7018774Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7018824Z graph_break [] 2025-12-04T10:05:37.7018896Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7018936Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7019003Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7019098Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7019438Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7019474Z graph_break [] 2025-12-04T10:05:37.7019546Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7019585Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7019639Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7019733Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7020073Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7020120Z graph_break [] 2025-12-04T10:05:37.7020194Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7020233Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7020286Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7020391Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7020731Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7020766Z graph_break [] 2025-12-04T10:05:37.7020840Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7020879Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7020933Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7021027Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7021368Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7021405Z graph_break [] 2025-12-04T10:05:37.7021476Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7021516Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7021569Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7021665Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7022001Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7022038Z graph_break [] 2025-12-04T10:05:37.7022110Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7022150Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7022213Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7022308Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7022658Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7022694Z graph_break [] 2025-12-04T10:05:37.7022766Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7022808Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7022861Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7022956Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7023294Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7023331Z graph_break [] 2025-12-04T10:05:37.7023401Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7023442Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7023496Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7023591Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7023952Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7023988Z graph_break [] 2025-12-04T10:05:37.7024061Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7024100Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7024153Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7024247Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7024590Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7024626Z graph_break [] 2025-12-04T10:05:37.7024698Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7024739Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7024794Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7024887Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7025226Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7025261Z graph_break [] 2025-12-04T10:05:37.7025333Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7025372Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7025426Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7025521Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7025860Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7025910Z graph_break [] 2025-12-04T10:05:37.7026026Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7026065Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7026119Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7026228Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7026566Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7026603Z graph_break [] 2025-12-04T10:05:37.7026675Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7026717Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7026769Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7026863Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7027204Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7027240Z graph_break [] 2025-12-04T10:05:37.7027311Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7027365Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7027418Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7027533Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7027870Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7027906Z graph_break [] 2025-12-04T10:05:37.7027978Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7028019Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7028071Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7028166Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7028507Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7028544Z graph_break [] 2025-12-04T10:05:37.7028615Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7028655Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7028708Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7028802Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7029141Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7029177Z graph_break [] 2025-12-04T10:05:37.7029249Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7029290Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7029343Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7029450Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7029799Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7029834Z graph_break [] 2025-12-04T10:05:37.7029906Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7029944Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7029999Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7030092Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7030433Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7030469Z graph_break [] 2025-12-04T10:05:37.7030540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7030579Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7030634Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7030728Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7031066Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7031122Z graph_break [] 2025-12-04T10:05:37.7031195Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7031236Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7031289Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7031382Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7031721Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7031758Z graph_break [] 2025-12-04T10:05:37.7031828Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7031868Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7031921Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7032016Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7032355Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7032391Z graph_break [] 2025-12-04T10:05:37.7032462Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7032504Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7032557Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7032653Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7032992Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7033043Z graph_break [] 2025-12-04T10:05:37.7033113Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7033153Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7033205Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7033310Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7033647Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7033684Z graph_break [] 2025-12-04T10:05:37.7033755Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7033795Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7033849Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7033944Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7034284Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7034319Z graph_break [] 2025-12-04T10:05:37.7034392Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7034444Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7034497Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7034601Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7034940Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7034976Z graph_break [] 2025-12-04T10:05:37.7035048Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7035087Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7035141Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7035235Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7035575Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7035610Z graph_break [] 2025-12-04T10:05:37.7035682Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7035721Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7035774Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7035868Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7036245Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7036281Z graph_break [] 2025-12-04T10:05:37.7036353Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7036392Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7036447Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7036541Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7036895Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7036930Z graph_break [] 2025-12-04T10:05:37.7037016Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7037055Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7037110Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7037203Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7037543Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7037579Z graph_break [] 2025-12-04T10:05:37.7037649Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7037689Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7037742Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7037837Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7038176Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7038225Z graph_break [] 2025-12-04T10:05:37.7038308Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7038350Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7038403Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7038497Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7038841Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7038877Z graph_break [] 2025-12-04T10:05:37.7038949Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7038989Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7039042Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7039137Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7039474Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7039513Z graph_break [] 2025-12-04T10:05:37.7039583Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7039625Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7039677Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7039771Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7040110Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7040160Z graph_break [] 2025-12-04T10:05:37.7040232Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7040271Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7040324Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7040417Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7040769Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7040805Z graph_break [] 2025-12-04T10:05:37.7040878Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7040918Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7040971Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7041065Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7041405Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7041440Z graph_break [] 2025-12-04T10:05:37.7041512Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7041567Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7041621Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7041715Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7042064Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7042100Z graph_break [] 2025-12-04T10:05:37.7042189Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.7042233Z Traceback (most recent call last): 2025-12-04T10:05:37.7042387Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.7042457Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.7042595Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.7042655Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.7042812Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.7042883Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.7042931Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7042933Z 2025-12-04T10:05:37.7042972Z Expected 0 but got 1. 2025-12-04T10:05:37.7043009Z Absolute difference: 1 2025-12-04T10:05:37.7043049Z Relative difference: inf 2025-12-04T10:05:37.7043051Z 2025-12-04T10:05:37.7043122Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7043278Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7043281Z 2025-12-04T10:05:37.7043367Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7043440Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7043480Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7043534Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7043855Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7043952Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7043987Z graph_break [] 2025-12-04T10:05:37.7044070Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7044110Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7044165Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7044261Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7044571Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7044607Z graph_break [] 2025-12-04T10:05:37.7044680Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7044718Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7044772Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7044868Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7045176Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7045223Z graph_break [] 2025-12-04T10:05:37.7045309Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7045349Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7045403Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7045498Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7045807Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7045843Z graph_break [] 2025-12-04T10:05:37.7045913Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7045994Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7046049Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7046145Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7046451Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7046488Z graph_break [] 2025-12-04T10:05:37.7046558Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7046599Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7046652Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7046747Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7047091Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7047129Z graph_break [] 2025-12-04T10:05:37.7047199Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7047258Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7047310Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7047405Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7047758Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7047795Z graph_break [] 2025-12-04T10:05:37.7047867Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7047907Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7047959Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7048055Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7048397Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7048431Z graph_break [] 2025-12-04T10:05:37.7048505Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7048544Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7048598Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7048705Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7049057Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7049093Z graph_break [] 2025-12-04T10:05:37.7049165Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7049204Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7049257Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7049352Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7049691Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7049726Z graph_break [] 2025-12-04T10:05:37.7049798Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7049839Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7049893Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7049989Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7050328Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7050363Z graph_break [] 2025-12-04T10:05:37.7050435Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7050475Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7050529Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7050623Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7050961Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7051007Z graph_break [] 2025-12-04T10:05:37.7051078Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7051120Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7051184Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7051280Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7051619Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7051655Z graph_break [] 2025-12-04T10:05:37.7051726Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7051768Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7051820Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7051915Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7052253Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7052300Z graph_break [] 2025-12-04T10:05:37.7052376Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7052417Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7052480Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7052575Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7052913Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7052949Z graph_break [] 2025-12-04T10:05:37.7053021Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7053062Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7053115Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7053211Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7053552Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7053589Z graph_break [] 2025-12-04T10:05:37.7053661Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7053699Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7053753Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7053848Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7054187Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7054222Z graph_break [] 2025-12-04T10:05:37.7054295Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7054334Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7054399Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7054493Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7054846Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7054882Z graph_break [] 2025-12-04T10:05:37.7054953Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7054993Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7055046Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7055140Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7055482Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7055518Z graph_break [] 2025-12-04T10:05:37.7055591Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7055630Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7055685Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7055779Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7056187Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7056224Z graph_break [] 2025-12-04T10:05:37.7056295Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7056334Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7056386Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7056481Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7056819Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7056855Z graph_break [] 2025-12-04T10:05:37.7056926Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7056967Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7057020Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7057115Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7057456Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7057493Z graph_break [] 2025-12-04T10:05:37.7057564Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7057603Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7057656Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7057751Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7058090Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7058146Z graph_break [] 2025-12-04T10:05:37.7058217Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7058256Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7058324Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7058419Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7058757Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7058794Z graph_break [] 2025-12-04T10:05:37.7058865Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7058906Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7058959Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7059054Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7059395Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7059443Z graph_break [] 2025-12-04T10:05:37.7059515Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7059555Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7059608Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7059713Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7060053Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7060088Z graph_break [] 2025-12-04T10:05:37.7060161Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7060200Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7060253Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7060348Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7060687Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7060723Z graph_break [] 2025-12-04T10:05:37.7060796Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7060835Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7060889Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7060983Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7061321Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7061357Z graph_break [] 2025-12-04T10:05:37.7061430Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7061469Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7061538Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7061633Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7061985Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7062022Z graph_break [] 2025-12-04T10:05:37.7062093Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7062135Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7062188Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7062283Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7062622Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7062659Z graph_break [] 2025-12-04T10:05:37.7062730Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7062770Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7062823Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7062917Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7063282Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7063319Z graph_break [] 2025-12-04T10:05:37.7063391Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7063430Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7063483Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7063579Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7063920Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7063956Z graph_break [] 2025-12-04T10:05:37.7064027Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7064068Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7064121Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7064215Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7064557Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7064592Z graph_break [] 2025-12-04T10:05:37.7064664Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7064704Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7064757Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7064852Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7065192Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7065238Z graph_break [] 2025-12-04T10:05:37.7065309Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7065348Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7065402Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7065507Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7065850Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7065888Z graph_break [] 2025-12-04T10:05:37.7066007Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7066047Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7066101Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7066195Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7066530Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7066565Z graph_break [] 2025-12-04T10:05:37.7066637Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7066691Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7066746Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7066853Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7067194Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7067232Z graph_break [] 2025-12-04T10:05:37.7067303Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7067343Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7067396Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7067492Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7067837Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7067873Z graph_break [] 2025-12-04T10:05:37.7067943Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7067983Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7068035Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7068131Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7068470Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7068506Z graph_break [] 2025-12-04T10:05:37.7068577Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7068619Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7068672Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7068796Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7069150Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7069187Z graph_break [] 2025-12-04T10:05:37.7069258Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7069298Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7069352Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7069449Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7069791Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7069828Z graph_break [] 2025-12-04T10:05:37.7069899Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7069939Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7069993Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7070086Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7070425Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7070483Z graph_break [] 2025-12-04T10:05:37.7070555Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7070595Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7070649Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7070743Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7071082Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7071118Z graph_break [] 2025-12-04T10:05:37.7071189Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7071227Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7071282Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7071375Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7071718Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7071752Z graph_break [] 2025-12-04T10:05:37.7071826Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7071865Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7071919Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7072014Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7072353Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7072400Z graph_break [] 2025-12-04T10:05:37.7072472Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7072513Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7072565Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7072672Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7073009Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7073046Z graph_break [] 2025-12-04T10:05:37.7073117Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7073158Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7073211Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7073305Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7073646Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7073681Z graph_break [] 2025-12-04T10:05:37.7073769Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.7073825Z Traceback (most recent call last): 2025-12-04T10:05:37.7073977Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.7074057Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.7074197Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.7074258Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.7074418Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.7074489Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.7074537Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7074539Z 2025-12-04T10:05:37.7074578Z Expected 0 but got 1. 2025-12-04T10:05:37.7074616Z Absolute difference: 1 2025-12-04T10:05:37.7074657Z Relative difference: inf 2025-12-04T10:05:37.7074659Z 2025-12-04T10:05:37.7074730Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7074888Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7074890Z 2025-12-04T10:05:37.7074977Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7075050Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7075091Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7075144Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7075456Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7075552Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7075588Z graph_break [] 2025-12-04T10:05:37.7075659Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7075702Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7075756Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7075863Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7076220Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7076275Z graph_break [] 2025-12-04T10:05:37.7076347Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7076387Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7076440Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7076534Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7076844Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7076881Z graph_break [] 2025-12-04T10:05:37.7076951Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7076991Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7077043Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7077139Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7077446Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7077496Z graph_break [] 2025-12-04T10:05:37.7077582Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7077622Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7077676Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7077770Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7078077Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7078113Z graph_break [] 2025-12-04T10:05:37.7078184Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7078225Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7078278Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7078372Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7078715Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7078751Z graph_break [] 2025-12-04T10:05:37.7078823Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7078862Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7078916Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7079009Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7079351Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7079386Z graph_break [] 2025-12-04T10:05:37.7079458Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7079511Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7079565Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7079660Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7080013Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7080049Z graph_break [] 2025-12-04T10:05:37.7080121Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7080160Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7080214Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7080308Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7080651Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7080688Z graph_break [] 2025-12-04T10:05:37.7080759Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7080800Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7080852Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7080965Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7081322Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7081359Z graph_break [] 2025-12-04T10:05:37.7081430Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7081470Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7081523Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7081617Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7081955Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7081992Z graph_break [] 2025-12-04T10:05:37.7082064Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7082103Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7082156Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7082250Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7082596Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7082631Z graph_break [] 2025-12-04T10:05:37.7082703Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7082744Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7082797Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7082892Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7083231Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7083277Z graph_break [] 2025-12-04T10:05:37.7083347Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7083398Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7083451Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7083546Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7083888Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7083924Z graph_break [] 2025-12-04T10:05:37.7083995Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7084034Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7084089Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7084182Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7084523Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7084569Z graph_break [] 2025-12-04T10:05:37.7084642Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7084691Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7084744Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7084840Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7085179Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7085216Z graph_break [] 2025-12-04T10:05:37.7085289Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7085328Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7085384Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7085477Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7085821Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7085856Z graph_break [] 2025-12-04T10:05:37.7085964Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7086003Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7086058Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7086152Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7086498Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7086535Z graph_break [] 2025-12-04T10:05:37.7086606Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7086667Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7086721Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7086815Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7087168Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7087207Z graph_break [] 2025-12-04T10:05:37.7087278Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7087320Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7087372Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7087467Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7087806Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7087844Z graph_break [] 2025-12-04T10:05:37.7087917Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7087957Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7088010Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7088122Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7088474Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7088511Z graph_break [] 2025-12-04T10:05:37.7088582Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7088626Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7088678Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7088800Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7089142Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7089178Z graph_break [] 2025-12-04T10:05:37.7089251Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7089291Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7089345Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7089438Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7089779Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7089815Z graph_break [] 2025-12-04T10:05:37.7089888Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7089929Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7089984Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7090079Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7090418Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7090464Z graph_break [] 2025-12-04T10:05:37.7090536Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7090575Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7090639Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7090733Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7091076Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7091111Z graph_break [] 2025-12-04T10:05:37.7091185Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7091224Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7091277Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7091370Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7094586Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7094647Z graph_break [] 2025-12-04T10:05:37.7094722Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7094763Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7094829Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7094925Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7095264Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7095299Z graph_break [] 2025-12-04T10:05:37.7095372Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7095413Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7095465Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7095561Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7095899Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7095970Z graph_break [] 2025-12-04T10:05:37.7096042Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7096082Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7096134Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7096231Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7096569Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7096607Z graph_break [] 2025-12-04T10:05:37.7096681Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7096722Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7096791Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7096886Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7097239Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7097275Z graph_break [] 2025-12-04T10:05:37.7097346Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7097388Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7097441Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7097538Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7097875Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7097911Z graph_break [] 2025-12-04T10:05:37.7097982Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7098022Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7098076Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7098169Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7098541Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7098578Z graph_break [] 2025-12-04T10:05:37.7098649Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7098689Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7098743Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7098837Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7099175Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7099211Z graph_break [] 2025-12-04T10:05:37.7099282Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7099323Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7099376Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7099470Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7099809Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7099844Z graph_break [] 2025-12-04T10:05:37.7099916Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7099955Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7100009Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7100103Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7100449Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7100495Z graph_break [] 2025-12-04T10:05:37.7100566Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7100606Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7100669Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7100765Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7101104Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7101141Z graph_break [] 2025-12-04T10:05:37.7101212Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7101253Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7101305Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7101399Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7101737Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7101792Z graph_break [] 2025-12-04T10:05:37.7101862Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7101903Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7101958Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7102065Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7102411Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7102447Z graph_break [] 2025-12-04T10:05:37.7102518Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7102558Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7102611Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7102707Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7103046Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7103082Z graph_break [] 2025-12-04T10:05:37.7103153Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7103192Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7103246Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7103340Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7103677Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7103713Z graph_break [] 2025-12-04T10:05:37.7103784Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7103823Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7103890Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7103984Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7104332Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7104367Z graph_break [] 2025-12-04T10:05:37.7104438Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7104479Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7104533Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7104628Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7104964Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7105000Z graph_break [] 2025-12-04T10:05:37.7105072Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7105111Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7105165Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7105259Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7105619Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7105654Z graph_break [] 2025-12-04T10:05:37.7105726Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7105766Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7105818Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7105912Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7106295Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7106332Z graph_break [] 2025-12-04T10:05:37.7106403Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7106444Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7106497Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7106591Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7106932Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7106968Z graph_break [] 2025-12-04T10:05:37.7107039Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7107079Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7107132Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7107226Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7107564Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7107618Z graph_break [] 2025-12-04T10:05:37.7107689Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7107729Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7107781Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7107889Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7108229Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7108266Z graph_break [] 2025-12-04T10:05:37.7108339Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7108379Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7108432Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7108527Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7108866Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7108901Z graph_break [] 2025-12-04T10:05:37.7109006Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.7109051Z Traceback (most recent call last): 2025-12-04T10:05:37.7109221Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.7109293Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.7109434Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.7109494Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.7109653Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.7109724Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.7109773Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7109775Z 2025-12-04T10:05:37.7109813Z Expected 0 but got 1. 2025-12-04T10:05:37.7109853Z Absolute difference: 1 2025-12-04T10:05:37.7109893Z Relative difference: inf 2025-12-04T10:05:37.7109895Z 2025-12-04T10:05:37.7109967Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7110124Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7110127Z 2025-12-04T10:05:37.7110215Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7110287Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7110328Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7110381Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7110693Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7110792Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7110827Z graph_break [] 2025-12-04T10:05:37.7110900Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7110941Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7111004Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7111099Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7111423Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7111459Z graph_break [] 2025-12-04T10:05:37.7111530Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7111571Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7111624Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7111717Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7112028Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7112063Z graph_break [] 2025-12-04T10:05:37.7112135Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7112174Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7112227Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7112321Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7112629Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7112675Z graph_break [] 2025-12-04T10:05:37.7112757Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7112798Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7112851Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7112945Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7113253Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7113288Z graph_break [] 2025-12-04T10:05:37.7113359Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7113399Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7113452Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7113547Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7113888Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7113925Z graph_break [] 2025-12-04T10:05:37.7113995Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7114036Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7114089Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7114183Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7114527Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7114575Z graph_break [] 2025-12-04T10:05:37.7114645Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7114685Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7114737Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7114832Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7115184Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7115221Z graph_break [] 2025-12-04T10:05:37.7115291Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7115332Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7115385Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7115481Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7115818Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7115855Z graph_break [] 2025-12-04T10:05:37.7115978Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7116018Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7116088Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7116184Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7116534Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7116572Z graph_break [] 2025-12-04T10:05:37.7116643Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7116682Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7116735Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7116830Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7117172Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7117209Z graph_break [] 2025-12-04T10:05:37.7117281Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7117321Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7117374Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7117467Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7117807Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7117842Z graph_break [] 2025-12-04T10:05:37.7117915Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7117954Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7118007Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7118102Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7118454Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7118490Z graph_break [] 2025-12-04T10:05:37.7118575Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7118615Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7118669Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7118763Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7119105Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7119140Z graph_break [] 2025-12-04T10:05:37.7119212Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7119251Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7119305Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7119400Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7119737Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7119785Z graph_break [] 2025-12-04T10:05:37.7119866Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7119906Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7119958Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7120054Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7120391Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7120427Z graph_break [] 2025-12-04T10:05:37.7120497Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7120538Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7120591Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7120684Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7121023Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7121059Z graph_break [] 2025-12-04T10:05:37.7121131Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7121171Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7121225Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7121320Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7121660Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7121696Z graph_break [] 2025-12-04T10:05:37.7121766Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7121817Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7121869Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7121964Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7122314Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7122350Z graph_break [] 2025-12-04T10:05:37.7122421Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7122461Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7122515Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7122608Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7122948Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7122983Z graph_break [] 2025-12-04T10:05:37.7123055Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7123094Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7123147Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7123251Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7123600Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7123636Z graph_break [] 2025-12-04T10:05:37.7123707Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7123746Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7123802Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7123897Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7124235Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7124271Z graph_break [] 2025-12-04T10:05:37.7124344Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7124383Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7124438Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7124532Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7124875Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7124911Z graph_break [] 2025-12-04T10:05:37.7124982Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7125023Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7125075Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7125172Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7125510Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7125563Z graph_break [] 2025-12-04T10:05:37.7125634Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7125684Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7125737Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7125831Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7126214Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7126251Z graph_break [] 2025-12-04T10:05:37.7126321Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7126361Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7126413Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7126507Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7126846Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7126903Z graph_break [] 2025-12-04T10:05:37.7126974Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7127028Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7127080Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7127177Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7127517Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7127553Z graph_break [] 2025-12-04T10:05:37.7127624Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7127663Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7127717Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7127810Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7128150Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7128185Z graph_break [] 2025-12-04T10:05:37.7128256Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7128295Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7128349Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7128443Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7128781Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7128817Z graph_break [] 2025-12-04T10:05:37.7128889Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7128943Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7128996Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7129091Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7129443Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7129479Z graph_break [] 2025-12-04T10:05:37.7129551Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7129589Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7129642Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7129737Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7130075Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7130111Z graph_break [] 2025-12-04T10:05:37.7130182Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7130223Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7130276Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7130381Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7130730Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7130766Z graph_break [] 2025-12-04T10:05:37.7130837Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7130877Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7130930Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7131026Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7131370Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7131407Z graph_break [] 2025-12-04T10:05:37.7131479Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7131519Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7131572Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7131666Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7132004Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7132040Z graph_break [] 2025-12-04T10:05:37.7132111Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7132153Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7132205Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7132300Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7132637Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7132685Z graph_break [] 2025-12-04T10:05:37.7132756Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7132795Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7132859Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7132953Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7133298Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7133333Z graph_break [] 2025-12-04T10:05:37.7133406Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7133445Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7133497Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7133591Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7133931Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7133978Z graph_break [] 2025-12-04T10:05:37.7134049Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7134088Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7134151Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7134245Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7134584Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7134619Z graph_break [] 2025-12-04T10:05:37.7134691Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7134730Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7134783Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7134880Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7135225Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7135261Z graph_break [] 2025-12-04T10:05:37.7135331Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7135371Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7135423Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7135519Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7135856Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7135893Z graph_break [] 2025-12-04T10:05:37.7135997Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7136053Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7136106Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7136200Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7136550Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7136587Z graph_break [] 2025-12-04T10:05:37.7136657Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7136699Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7136751Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7136846Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7137184Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7137220Z graph_break [] 2025-12-04T10:05:37.7137292Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7137333Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7137386Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7137481Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7137850Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7137888Z graph_break [] 2025-12-04T10:05:37.7137959Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7137999Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7138052Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7138146Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7138485Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7138521Z graph_break [] 2025-12-04T10:05:37.7138593Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7138632Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7138685Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7138779Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7139118Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7139153Z graph_break [] 2025-12-04T10:05:37.7139224Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7139263Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7139317Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7139410Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7139750Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7139796Z graph_break [] 2025-12-04T10:05:37.7139867Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7139906Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7139969Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7140063Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7140403Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7140439Z graph_break [] 2025-12-04T10:05:37.7140511Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7140552Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7140605Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7140698Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7141037Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7141087Z graph_break [] 2025-12-04T10:05:37.7141157Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7141197Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7141249Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7141357Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7141698Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7141733Z graph_break [] 2025-12-04T10:05:37.7141821Z ______________ MixOrderReductionTest.test_3layer_split_reduction _______________ 2025-12-04T10:05:37.7141866Z Traceback (most recent call last): 2025-12-04T10:05:37.7142019Z File "/var/lib/jenkins/pytorch/test/inductor/test_mix_order_reduction.py", line 203, in test_3layer_split_reduction 2025-12-04T10:05:37.7142090Z self.assertEqual(metrics.codegen_mix_order_reduction, 0) 2025-12-04T10:05:37.7142231Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/test_case.py", line 113, in assertEqual 2025-12-04T10:05:37.7142291Z return super().assertEqual(x, y, *args, **kwargs) 2025-12-04T10:05:37.7142449Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:05:37.7142519Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:05:37.7142567Z AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7142569Z 2025-12-04T10:05:37.7142607Z Expected 0 but got 1. 2025-12-04T10:05:37.7142645Z Absolute difference: 1 2025-12-04T10:05:37.7142685Z Relative difference: inf 2025-12-04T10:05:37.7142687Z 2025-12-04T10:05:37.7142758Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7142916Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7142919Z 2025-12-04T10:05:37.7143004Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7143077Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7143131Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7143185Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7143496Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7143603Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7143639Z graph_break [] 2025-12-04T10:05:37.7143710Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7143753Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7143806Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7143900Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7144212Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7144248Z graph_break [] 2025-12-04T10:05:37.7144319Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7144359Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7144412Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7144507Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7144840Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7144877Z graph_break [] 2025-12-04T10:05:37.7144947Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7144988Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7145040Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7145133Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7145440Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7145477Z graph_break [] 2025-12-04T10:05:37.7145547Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7145587Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7145640Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7145734Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7146088Z inductor [('triton_bundler_save_kernel', 16), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_miss', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7146124Z graph_break [] 2025-12-04T10:05:37.7146197Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7146237Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7146289Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7146386Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7146726Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7146781Z graph_break [] 2025-12-04T10:05:37.7146852Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7146890Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7146943Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7147051Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7147390Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7147426Z graph_break [] 2025-12-04T10:05:37.7147499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7147538Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7147590Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7147685Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7148027Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7148062Z graph_break [] 2025-12-04T10:05:37.7148133Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7148188Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7148241Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7148335Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7148697Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7148733Z graph_break [] 2025-12-04T10:05:37.7148804Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7148843Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7148897Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7148990Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7149332Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7149367Z graph_break [] 2025-12-04T10:05:37.7149437Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7149478Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7149530Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7149626Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7149966Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7150002Z graph_break [] 2025-12-04T10:05:37.7150073Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7150113Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7150166Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7150260Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7150616Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7150652Z graph_break [] 2025-12-04T10:05:37.7150732Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7150773Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7150825Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7150921Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7151258Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7151294Z graph_break [] 2025-12-04T10:05:37.7151365Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7151404Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7151457Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7151551Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7151890Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7151944Z graph_break [] 2025-12-04T10:05:37.7152026Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7152066Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7152120Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7152213Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7152556Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7152591Z graph_break [] 2025-12-04T10:05:37.7152662Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7152702Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7152755Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7152849Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7153189Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7153224Z graph_break [] 2025-12-04T10:05:37.7153296Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7153335Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7153389Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7153482Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7153821Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7153866Z graph_break [] 2025-12-04T10:05:37.7153938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7153977Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7154030Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7154124Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7154472Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7154509Z graph_break [] 2025-12-04T10:05:37.7154580Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7154622Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7154675Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7154771Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7155108Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7155146Z graph_break [] 2025-12-04T10:05:37.7155216Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7155256Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7155319Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7155414Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7155760Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7155797Z graph_break [] 2025-12-04T10:05:37.7155867Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7155907Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7156009Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7156104Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7156443Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7156480Z graph_break [] 2025-12-04T10:05:37.7156552Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7156592Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7156645Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7156741Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7157080Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7157116Z graph_break [] 2025-12-04T10:05:37.7157189Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7157228Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7157282Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7157377Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7157732Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7157767Z graph_break [] 2025-12-04T10:05:37.7157853Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7157893Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7157948Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7158044Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7158383Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7158419Z graph_break [] 2025-12-04T10:05:37.7158491Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7158530Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7158584Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7158681Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7159022Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7159071Z graph_break [] 2025-12-04T10:05:37.7159155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7159196Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7159250Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7159344Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7159683Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7159718Z graph_break [] 2025-12-04T10:05:37.7159789Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7159829Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7159883Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7159978Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7160318Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7160354Z graph_break [] 2025-12-04T10:05:37.7160425Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7160465Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7160518Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7160613Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7160956Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7160991Z graph_break [] 2025-12-04T10:05:37.7161062Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7161116Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7161168Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7161263Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7161611Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7161648Z graph_break [] 2025-12-04T10:05:37.7161719Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7161759Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7161813Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7161908Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7162246Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7162282Z graph_break [] 2025-12-04T10:05:37.7162353Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7162393Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7162445Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7162551Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7162903Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7162939Z graph_break [] 2025-12-04T10:05:37.7163011Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7163050Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7163103Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7163198Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7163536Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7163571Z graph_break [] 2025-12-04T10:05:37.7163643Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7163681Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7163735Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7163829Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7164167Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7164202Z graph_break [] 2025-12-04T10:05:37.7164273Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7164313Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7164366Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7164461Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7164799Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7164848Z graph_break [] 2025-12-04T10:05:37.7164920Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7164976Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7165031Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7165124Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7165467Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7165503Z graph_break [] 2025-12-04T10:05:37.7165573Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7165614Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7165666Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7165761Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7166136Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7166190Z graph_break [] 2025-12-04T10:05:37.7166261Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7166314Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7166367Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7166462Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7166798Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7166834Z graph_break [] 2025-12-04T10:05:37.7166905Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7166945Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7166998Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7167092Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7167431Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7167467Z graph_break [] 2025-12-04T10:05:37.7167539Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7167579Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7167632Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7167728Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7168065Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7168101Z graph_break [] 2025-12-04T10:05:37.7168173Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7168225Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7168279Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7168371Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7168722Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7168758Z graph_break [] 2025-12-04T10:05:37.7168830Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7168870Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7168924Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7169019Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7169361Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7169396Z graph_break [] 2025-12-04T10:05:37.7169469Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7169507Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7169560Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7169666Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7170014Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7170050Z graph_break [] 2025-12-04T10:05:37.7170122Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7170161Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7170214Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7170308Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7170647Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7170684Z graph_break [] 2025-12-04T10:05:37.7170755Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7170795Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7170848Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7170943Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7171283Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7171318Z graph_break [] 2025-12-04T10:05:37.7171390Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7171431Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7171483Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7171577Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7171915Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7171963Z graph_break [] 2025-12-04T10:05:37.7172034Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7172074Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7172136Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7172231Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7172571Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7172607Z graph_break [] 2025-12-04T10:05:37.7172680Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7172720Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7172772Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7172866Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7173203Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7173252Z graph_break [] 2025-12-04T10:05:37.7173325Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7173364Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7173427Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7173521Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7173859Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7173895Z graph_break [] 2025-12-04T10:05:37.7173968Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7174006Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7174059Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7174153Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7174491Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7174529Z graph_break [] 2025-12-04T10:05:37.7174600Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:05:37.7174638Z frames [('total', 1), ('ok', 1)] 2025-12-04T10:05:37.7174691Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T10:05:37.7174786Z aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)] 2025-12-04T10:05:37.7175121Z inductor [('triton_bundler_save_kernel', 16), ('async_compile_cache_miss', 2), ('benchmarking.InductorBenchmarker.benchmark', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_miss', 1), ('async_compile_cache_hit', 1), ('triton_bundler_save_static_autotuner', 1)] 2025-12-04T10:05:37.7175157Z graph_break [] 2025-12-04T10:05:37.7175394Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_mix_order_reduction/inductor.test_mix_order_reduction-181fc8d92d9e2229.xml - 2025-12-04T10:05:37.7175468Z =========================== short test summary info ============================ 2025-12-04T10:05:37.7175677Z FAILED [1.0934s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7175679Z 2025-12-04T10:05:37.7175718Z Expected 0 but got 1. 2025-12-04T10:05:37.7175768Z Absolute difference: 1 2025-12-04T10:05:37.7175808Z Relative difference: inf 2025-12-04T10:05:37.7175811Z 2025-12-04T10:05:37.7175885Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7176094Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7176096Z 2025-12-04T10:05:37.7176184Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7176364Z FAILED [0.2084s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7176367Z 2025-12-04T10:05:37.7176404Z Expected 0 but got 1. 2025-12-04T10:05:37.7176443Z Absolute difference: 1 2025-12-04T10:05:37.7176482Z Relative difference: inf 2025-12-04T10:05:37.7176484Z 2025-12-04T10:05:37.7176555Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7176709Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7176711Z 2025-12-04T10:05:37.7176797Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7176992Z FAILED [0.2067s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7176995Z 2025-12-04T10:05:37.7177045Z Expected 0 but got 1. 2025-12-04T10:05:37.7177084Z Absolute difference: 1 2025-12-04T10:05:37.7177124Z Relative difference: inf 2025-12-04T10:05:37.7177126Z 2025-12-04T10:05:37.7177196Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7177348Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7177351Z 2025-12-04T10:05:37.7177434Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7177614Z FAILED [0.2055s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7177617Z 2025-12-04T10:05:37.7177654Z Expected 0 but got 1. 2025-12-04T10:05:37.7177692Z Absolute difference: 1 2025-12-04T10:05:37.7177732Z Relative difference: inf 2025-12-04T10:05:37.7177734Z 2025-12-04T10:05:37.7177804Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7177956Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7177959Z 2025-12-04T10:05:37.7178042Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7178221Z FAILED [0.2093s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7178223Z 2025-12-04T10:05:37.7178261Z Expected 0 but got 1. 2025-12-04T10:05:37.7178299Z Absolute difference: 1 2025-12-04T10:05:37.7178337Z Relative difference: inf 2025-12-04T10:05:37.7178339Z 2025-12-04T10:05:37.7178409Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7178561Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7178563Z 2025-12-04T10:05:37.7178647Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7178823Z FAILED [0.3329s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7178853Z 2025-12-04T10:05:37.7178891Z Expected 0 but got 1. 2025-12-04T10:05:37.7178928Z Absolute difference: 1 2025-12-04T10:05:37.7178968Z Relative difference: inf 2025-12-04T10:05:37.7178970Z 2025-12-04T10:05:37.7179058Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7179211Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7179213Z 2025-12-04T10:05:37.7179297Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7179473Z FAILED [0.3450s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7179476Z 2025-12-04T10:05:37.7179514Z Expected 0 but got 1. 2025-12-04T10:05:37.7179552Z Absolute difference: 1 2025-12-04T10:05:37.7179591Z Relative difference: inf 2025-12-04T10:05:37.7179593Z 2025-12-04T10:05:37.7179662Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7179813Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7179815Z 2025-12-04T10:05:37.7179897Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7180073Z FAILED [0.3533s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7180088Z 2025-12-04T10:05:37.7180125Z Expected 0 but got 1. 2025-12-04T10:05:37.7180162Z Absolute difference: 1 2025-12-04T10:05:37.7180201Z Relative difference: inf 2025-12-04T10:05:37.7180203Z 2025-12-04T10:05:37.7180284Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7180437Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7180440Z 2025-12-04T10:05:37.7180525Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7180703Z FAILED [0.3602s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7180706Z 2025-12-04T10:05:37.7180744Z Expected 0 but got 1. 2025-12-04T10:05:37.7180781Z Absolute difference: 1 2025-12-04T10:05:37.7180820Z Relative difference: inf 2025-12-04T10:05:37.7180823Z 2025-12-04T10:05:37.7180892Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7181043Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7181046Z 2025-12-04T10:05:37.7181128Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7181306Z FAILED [0.3681s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7181308Z 2025-12-04T10:05:37.7181345Z Expected 0 but got 1. 2025-12-04T10:05:37.7181382Z Absolute difference: 1 2025-12-04T10:05:37.7181422Z Relative difference: inf 2025-12-04T10:05:37.7181423Z 2025-12-04T10:05:37.7181493Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7181644Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7181647Z 2025-12-04T10:05:37.7181729Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7181907Z FAILED [0.3612s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7181909Z 2025-12-04T10:05:37.7181947Z Expected 0 but got 1. 2025-12-04T10:05:37.7182000Z Absolute difference: 1 2025-12-04T10:05:37.7182038Z Relative difference: inf 2025-12-04T10:05:37.7182040Z 2025-12-04T10:05:37.7182109Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7182260Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7182262Z 2025-12-04T10:05:37.7182358Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7182534Z FAILED [0.3650s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7182538Z 2025-12-04T10:05:37.7182575Z Expected 0 but got 1. 2025-12-04T10:05:37.7182613Z Absolute difference: 1 2025-12-04T10:05:37.7182652Z Relative difference: inf 2025-12-04T10:05:37.7182654Z 2025-12-04T10:05:37.7182724Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7182879Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7182881Z 2025-12-04T10:05:37.7182964Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7183141Z FAILED [0.3894s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7183143Z 2025-12-04T10:05:37.7183180Z Expected 0 but got 1. 2025-12-04T10:05:37.7183218Z Absolute difference: 1 2025-12-04T10:05:37.7183257Z Relative difference: inf 2025-12-04T10:05:37.7183272Z 2025-12-04T10:05:37.7183341Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7183504Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7183506Z 2025-12-04T10:05:37.7183589Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7183767Z FAILED [0.3626s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7183769Z 2025-12-04T10:05:37.7183806Z Expected 0 but got 1. 2025-12-04T10:05:37.7183843Z Absolute difference: 1 2025-12-04T10:05:37.7183882Z Relative difference: inf 2025-12-04T10:05:37.7183884Z 2025-12-04T10:05:37.7183955Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7184106Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7184109Z 2025-12-04T10:05:37.7184192Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7184370Z FAILED [0.3664s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7184372Z 2025-12-04T10:05:37.7184410Z Expected 0 but got 1. 2025-12-04T10:05:37.7184451Z Absolute difference: 1 2025-12-04T10:05:37.7184490Z Relative difference: inf 2025-12-04T10:05:37.7184491Z 2025-12-04T10:05:37.7184561Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7184711Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7184714Z 2025-12-04T10:05:37.7184797Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7184975Z FAILED [0.6296s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7184978Z 2025-12-04T10:05:37.7185017Z Expected 0 but got 1. 2025-12-04T10:05:37.7185056Z Absolute difference: 1 2025-12-04T10:05:37.7185096Z Relative difference: inf 2025-12-04T10:05:37.7185097Z 2025-12-04T10:05:37.7185166Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7185331Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7185333Z 2025-12-04T10:05:37.7185414Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7185603Z FAILED [0.3363s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7185605Z 2025-12-04T10:05:37.7185642Z Expected 0 but got 1. 2025-12-04T10:05:37.7185680Z Absolute difference: 1 2025-12-04T10:05:37.7185719Z Relative difference: inf 2025-12-04T10:05:37.7185721Z 2025-12-04T10:05:37.7185791Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7185984Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7185987Z 2025-12-04T10:05:37.7186069Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7186247Z FAILED [0.3886s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7186249Z 2025-12-04T10:05:37.7186285Z Expected 0 but got 1. 2025-12-04T10:05:37.7186322Z Absolute difference: 1 2025-12-04T10:05:37.7186360Z Relative difference: inf 2025-12-04T10:05:37.7186363Z 2025-12-04T10:05:37.7186432Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7186585Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7186602Z 2025-12-04T10:05:37.7186686Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7186875Z FAILED [0.3734s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7186879Z 2025-12-04T10:05:37.7186917Z Expected 0 but got 1. 2025-12-04T10:05:37.7186954Z Absolute difference: 1 2025-12-04T10:05:37.7186992Z Relative difference: inf 2025-12-04T10:05:37.7186994Z 2025-12-04T10:05:37.7187063Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7187217Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7187219Z 2025-12-04T10:05:37.7187301Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7187480Z FAILED [0.3848s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7187483Z 2025-12-04T10:05:37.7187521Z Expected 0 but got 1. 2025-12-04T10:05:37.7187559Z Absolute difference: 1 2025-12-04T10:05:37.7187598Z Relative difference: inf 2025-12-04T10:05:37.7187602Z 2025-12-04T10:05:37.7187672Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7187826Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7187828Z 2025-12-04T10:05:37.7187911Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7188089Z FAILED [0.3846s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7188091Z 2025-12-04T10:05:37.7188128Z Expected 0 but got 1. 2025-12-04T10:05:37.7188168Z Absolute difference: 1 2025-12-04T10:05:37.7188206Z Relative difference: inf 2025-12-04T10:05:37.7188208Z 2025-12-04T10:05:37.7188278Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7188429Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7188446Z 2025-12-04T10:05:37.7188529Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7188705Z FAILED [0.3746s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7188707Z 2025-12-04T10:05:37.7188745Z Expected 0 but got 1. 2025-12-04T10:05:37.7188782Z Absolute difference: 1 2025-12-04T10:05:37.7188835Z Relative difference: inf 2025-12-04T10:05:37.7188837Z 2025-12-04T10:05:37.7188907Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7189059Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7189063Z 2025-12-04T10:05:37.7189144Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7189323Z FAILED [0.3845s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7189326Z 2025-12-04T10:05:37.7189363Z Expected 0 but got 1. 2025-12-04T10:05:37.7189400Z Absolute difference: 1 2025-12-04T10:05:37.7189440Z Relative difference: inf 2025-12-04T10:05:37.7189441Z 2025-12-04T10:05:37.7189510Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7189663Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7189665Z 2025-12-04T10:05:37.7189746Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7189935Z FAILED [0.3739s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7189937Z 2025-12-04T10:05:37.7189973Z Expected 0 but got 1. 2025-12-04T10:05:37.7190028Z Absolute difference: 1 2025-12-04T10:05:37.7190067Z Relative difference: inf 2025-12-04T10:05:37.7190070Z 2025-12-04T10:05:37.7190140Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7190292Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7190294Z 2025-12-04T10:05:37.7190377Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7190554Z FAILED [0.3478s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7190556Z 2025-12-04T10:05:37.7190594Z Expected 0 but got 1. 2025-12-04T10:05:37.7190632Z Absolute difference: 1 2025-12-04T10:05:37.7190672Z Relative difference: inf 2025-12-04T10:05:37.7190674Z 2025-12-04T10:05:37.7190743Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7190895Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7190898Z 2025-12-04T10:05:37.7190981Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7191159Z FAILED [0.4126s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7191161Z 2025-12-04T10:05:37.7191198Z Expected 0 but got 1. 2025-12-04T10:05:37.7191236Z Absolute difference: 1 2025-12-04T10:05:37.7191276Z Relative difference: inf 2025-12-04T10:05:37.7191278Z 2025-12-04T10:05:37.7191347Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7191501Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7191503Z 2025-12-04T10:05:37.7191585Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7191763Z FAILED [0.4053s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7191776Z 2025-12-04T10:05:37.7191813Z Expected 0 but got 1. 2025-12-04T10:05:37.7191850Z Absolute difference: 1 2025-12-04T10:05:37.7191889Z Relative difference: inf 2025-12-04T10:05:37.7191890Z 2025-12-04T10:05:37.7191960Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7192122Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7192124Z 2025-12-04T10:05:37.7192207Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7192385Z FAILED [0.3605s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7192387Z 2025-12-04T10:05:37.7192425Z Expected 0 but got 1. 2025-12-04T10:05:37.7192462Z Absolute difference: 1 2025-12-04T10:05:37.7192501Z Relative difference: inf 2025-12-04T10:05:37.7192504Z 2025-12-04T10:05:37.7192574Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7192725Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7192726Z 2025-12-04T10:05:37.7192809Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7192986Z FAILED [0.3353s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7192988Z 2025-12-04T10:05:37.7193036Z Expected 0 but got 1. 2025-12-04T10:05:37.7193073Z Absolute difference: 1 2025-12-04T10:05:37.7193112Z Relative difference: inf 2025-12-04T10:05:37.7193114Z 2025-12-04T10:05:37.7193182Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7193350Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7193354Z 2025-12-04T10:05:37.7193435Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7193612Z FAILED [0.3361s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7193614Z 2025-12-04T10:05:37.7193651Z Expected 0 but got 1. 2025-12-04T10:05:37.7193690Z Absolute difference: 1 2025-12-04T10:05:37.7193729Z Relative difference: inf 2025-12-04T10:05:37.7193731Z 2025-12-04T10:05:37.7193801Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7193954Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7193956Z 2025-12-04T10:05:37.7194040Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7194216Z FAILED [0.3568s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7194220Z 2025-12-04T10:05:37.7194256Z Expected 0 but got 1. 2025-12-04T10:05:37.7194293Z Absolute difference: 1 2025-12-04T10:05:37.7194333Z Relative difference: inf 2025-12-04T10:05:37.7194335Z 2025-12-04T10:05:37.7194405Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7194556Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7194558Z 2025-12-04T10:05:37.7194641Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7194817Z FAILED [0.3804s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7194819Z 2025-12-04T10:05:37.7194858Z Expected 0 but got 1. 2025-12-04T10:05:37.7194896Z Absolute difference: 1 2025-12-04T10:05:37.7194950Z Relative difference: inf 2025-12-04T10:05:37.7194952Z 2025-12-04T10:05:37.7195021Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7195173Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7195175Z 2025-12-04T10:05:37.7195268Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7195445Z FAILED [0.3980s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7195447Z 2025-12-04T10:05:37.7195484Z Expected 0 but got 1. 2025-12-04T10:05:37.7195523Z Absolute difference: 1 2025-12-04T10:05:37.7195561Z Relative difference: inf 2025-12-04T10:05:37.7195563Z 2025-12-04T10:05:37.7195633Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7195786Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7195789Z 2025-12-04T10:05:37.7195871Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7196089Z FAILED [0.3489s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7196093Z 2025-12-04T10:05:37.7196130Z Expected 0 but got 1. 2025-12-04T10:05:37.7196168Z Absolute difference: 1 2025-12-04T10:05:37.7196206Z Relative difference: inf 2025-12-04T10:05:37.7196208Z 2025-12-04T10:05:37.7196298Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7196449Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7196451Z 2025-12-04T10:05:37.7196548Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7196724Z FAILED [0.3634s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7196730Z 2025-12-04T10:05:37.7196767Z Expected 0 but got 1. 2025-12-04T10:05:37.7196803Z Absolute difference: 1 2025-12-04T10:05:37.7196842Z Relative difference: inf 2025-12-04T10:05:37.7196844Z 2025-12-04T10:05:37.7196914Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7197066Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7197069Z 2025-12-04T10:05:37.7197151Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7197330Z FAILED [0.3195s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7197332Z 2025-12-04T10:05:37.7197372Z Expected 0 but got 1. 2025-12-04T10:05:37.7197413Z Absolute difference: 1 2025-12-04T10:05:37.7197451Z Relative difference: inf 2025-12-04T10:05:37.7197453Z 2025-12-04T10:05:37.7197522Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7197673Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7197676Z 2025-12-04T10:05:37.7197759Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7197935Z FAILED [0.3517s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7197938Z 2025-12-04T10:05:37.7197975Z Expected 0 but got 1. 2025-12-04T10:05:37.7198012Z Absolute difference: 1 2025-12-04T10:05:37.7198050Z Relative difference: inf 2025-12-04T10:05:37.7198052Z 2025-12-04T10:05:37.7198122Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7198275Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7198298Z 2025-12-04T10:05:37.7198383Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7198561Z FAILED [0.2083s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7198583Z 2025-12-04T10:05:37.7198621Z Expected 0 but got 1. 2025-12-04T10:05:37.7198658Z Absolute difference: 1 2025-12-04T10:05:37.7198697Z Relative difference: inf 2025-12-04T10:05:37.7198700Z 2025-12-04T10:05:37.7198769Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7198921Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7198924Z 2025-12-04T10:05:37.7199005Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7199185Z FAILED [0.2091s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7199187Z 2025-12-04T10:05:37.7199224Z Expected 0 but got 1. 2025-12-04T10:05:37.7199265Z Absolute difference: 1 2025-12-04T10:05:37.7199303Z Relative difference: inf 2025-12-04T10:05:37.7199305Z 2025-12-04T10:05:37.7199376Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7199527Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7199545Z 2025-12-04T10:05:37.7199627Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7199815Z FAILED [0.2022s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7199817Z 2025-12-04T10:05:37.7199854Z Expected 0 but got 1. 2025-12-04T10:05:37.7199893Z Absolute difference: 1 2025-12-04T10:05:37.7199931Z Relative difference: inf 2025-12-04T10:05:37.7199933Z 2025-12-04T10:05:37.7200005Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7200156Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7200158Z 2025-12-04T10:05:37.7200240Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7200417Z FAILED [0.2046s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7200420Z 2025-12-04T10:05:37.7200457Z Expected 0 but got 1. 2025-12-04T10:05:37.7200494Z Absolute difference: 1 2025-12-04T10:05:37.7200532Z Relative difference: inf 2025-12-04T10:05:37.7200535Z 2025-12-04T10:05:37.7200604Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7200758Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7200760Z 2025-12-04T10:05:37.7200841Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7201021Z FAILED [0.2048s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7201023Z 2025-12-04T10:05:37.7201060Z Expected 0 but got 1. 2025-12-04T10:05:37.7201097Z Absolute difference: 1 2025-12-04T10:05:37.7201135Z Relative difference: inf 2025-12-04T10:05:37.7201139Z 2025-12-04T10:05:37.7201208Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7201362Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7201364Z 2025-12-04T10:05:37.7201446Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7201636Z FAILED [0.2004s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7201638Z 2025-12-04T10:05:37.7201675Z Expected 0 but got 1. 2025-12-04T10:05:37.7201712Z Absolute difference: 1 2025-12-04T10:05:37.7201751Z Relative difference: inf 2025-12-04T10:05:37.7201752Z 2025-12-04T10:05:37.7201834Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7201986Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7201989Z 2025-12-04T10:05:37.7202072Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7202249Z FAILED [0.2109s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7202250Z 2025-12-04T10:05:37.7202289Z Expected 0 but got 1. 2025-12-04T10:05:37.7202326Z Absolute difference: 1 2025-12-04T10:05:37.7202364Z Relative difference: inf 2025-12-04T10:05:37.7202366Z 2025-12-04T10:05:37.7202435Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7202588Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7202591Z 2025-12-04T10:05:37.7202673Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7202852Z FAILED [0.2071s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7202871Z 2025-12-04T10:05:37.7202909Z Expected 0 but got 1. 2025-12-04T10:05:37.7202946Z Absolute difference: 1 2025-12-04T10:05:37.7202997Z Relative difference: inf 2025-12-04T10:05:37.7202999Z 2025-12-04T10:05:37.7203069Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7203222Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7203223Z 2025-12-04T10:05:37.7203305Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7203484Z FAILED [0.2015s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7203485Z 2025-12-04T10:05:37.7203522Z Expected 0 but got 1. 2025-12-04T10:05:37.7203560Z Absolute difference: 1 2025-12-04T10:05:37.7203600Z Relative difference: inf 2025-12-04T10:05:37.7203601Z 2025-12-04T10:05:37.7203671Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7203822Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7203824Z 2025-12-04T10:05:37.7203907Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7204083Z FAILED [0.2228s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7204084Z 2025-12-04T10:05:37.7204122Z Expected 0 but got 1. 2025-12-04T10:05:37.7204159Z Absolute difference: 1 2025-12-04T10:05:37.7204198Z Relative difference: inf 2025-12-04T10:05:37.7204200Z 2025-12-04T10:05:37.7204269Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7204421Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7204424Z 2025-12-04T10:05:37.7204507Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7204684Z FAILED [0.2084s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7204698Z 2025-12-04T10:05:37.7204736Z Expected 0 but got 1. 2025-12-04T10:05:37.7204774Z Absolute difference: 1 2025-12-04T10:05:37.7204813Z Relative difference: inf 2025-12-04T10:05:37.7204814Z 2025-12-04T10:05:37.7204883Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7205046Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7205048Z 2025-12-04T10:05:37.7205130Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7205307Z FAILED [0.2044s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7205310Z 2025-12-04T10:05:37.7205346Z Expected 0 but got 1. 2025-12-04T10:05:37.7205384Z Absolute difference: 1 2025-12-04T10:05:37.7205423Z Relative difference: inf 2025-12-04T10:05:37.7205425Z 2025-12-04T10:05:37.7205496Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7205647Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7205648Z 2025-12-04T10:05:37.7205730Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7205907Z FAILED [0.2017s] inductor/test_mix_order_reduction.py::MixOrderReductionTest::test_3layer_split_reduction - AssertionError: Scalars are not equal! 2025-12-04T10:05:37.7205910Z 2025-12-04T10:05:37.7205987Z Expected 0 but got 1. 2025-12-04T10:05:37.7206041Z Absolute difference: 1 2025-12-04T10:05:37.7206081Z Relative difference: inf 2025-12-04T10:05:37.7206083Z 2025-12-04T10:05:37.7206151Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:37.7206317Z PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_mix_order_reduction.py MixOrderReductionTest.test_3layer_split_reduction 2025-12-04T10:05:37.7206320Z 2025-12-04T10:05:37.7206403Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:37.7206459Z ============================= 50 failed in 17.02s ============================== 2025-12-04T10:05:37.7206461Z 2025-12-04T10:05:37.7206647Z FINISHED PRINTING LOG FILE of inductor/test_mix_order_reduction 1/1 (test/test-reports/inductor.test_mix_order_reduction_1.1_7f4938ab0968cfe4_.log) 2025-12-04T10:05:37.7206650Z 2025-12-04T10:05:37.7206771Z Finished inductor/test_mix_order_reduction 1/1 ... [2025-12-04 10:05:37.578360][5636758.084742412], took 0.39min 2025-12-04T10:05:37.7207008Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:05:37.7207125Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:05:37.7207221Z GITHUB_RUN_ID, GITHUB_RUN_ATTEMPT, or ARTIFACTS_FILE_SUFFIX not set, not uploading 2025-12-04T10:05:37.7207272Z Uploading artifacts took 0.00 seconds 2025-12-04T10:05:37.7207328Z inductor/test_mix_order_reduction 1/1 failed! 2025-12-04T10:05:37.7207423Z Running dynamo/test_cudagraphs 1/1 ... [2025-12-04 10:05:37.667842][5636758.174239445] 2025-12-04T10:05:37.7207472Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:05:37.7207831Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_cudagraphs.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:05:37.668181] 2025-12-04T10:05:39.9457722Z 2025-12-04T10:05:39.9458960Z dynamo/test_cudagraphs 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_cudagraphs_1.1_62cd5114dee3132b_.log 2025-12-04T10:05:39.9460010Z Running 0 items in this shard: 2025-12-04T10:05:39.9460271Z 2025-12-04T10:05:39.9460654Z Finished dynamo/test_cudagraphs 1/1 ... [2025-12-04 10:05:39.945346][5636760.451747636], took 0.04min 2025-12-04T10:05:39.9468459Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:05:40.0332064Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:05:40.0342097Z Running inductor/test_alignment 1/1 ... [2025-12-04 10:05:40.033703][5636760.540101162] 2025-12-04T10:05:40.0342762Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:05:40.0344296Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_alignment.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:05:40.034081] 2025-12-04T10:05:45.8934462Z 2025-12-04T10:05:45.8935881Z inductor/test_alignment 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_alignment_1.1_c40def6eb95a8d15_.log 2025-12-04T10:05:45.8937026Z Running 0 items in this shard: 2025-12-04T10:05:45.8937282Z 2025-12-04T10:05:45.8937664Z Finished inductor/test_alignment 1/1 ... [2025-12-04 10:05:45.893048][5636766.399449489], took 0.10min 2025-12-04T10:05:45.8943350Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:05:45.9813854Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:05:45.9820141Z Running inductor/test_padding 1/1 ... [2025-12-04 10:05:45.981705][5636766.488103831] 2025-12-04T10:05:45.9820753Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:05:45.9823519Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_padding.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:05:45.982069] 2025-12-04T10:05:51.9508912Z 2025-12-04T10:05:51.9510191Z inductor/test_padding 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_padding_1.1_edb9c87e50d70e77_.log 2025-12-04T10:05:51.9511189Z Running 0 items in this shard: 2025-12-04T10:05:51.9511441Z 2025-12-04T10:05:51.9511882Z Finished inductor/test_padding 1/1 ... [2025-12-04 10:05:51.950412][5636772.456812445], took 0.10min 2025-12-04T10:05:51.9520363Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:05:52.0395337Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:05:52.0401547Z Running dynamo/test_profiler 1/1 ... [2025-12-04 10:05:52.039970][5636772.546367322] 2025-12-04T10:05:52.0402172Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:05:52.0405796Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_profiler.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:05:52.040360] 2025-12-04T10:05:54.2568948Z 2025-12-04T10:05:54.2570210Z dynamo/test_profiler 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_profiler_1.1_512f30b5b1460238_.log 2025-12-04T10:05:54.2571195Z Running 0 items in this shard: 2025-12-04T10:05:54.2571464Z 2025-12-04T10:05:54.2571847Z Finished dynamo/test_profiler 1/1 ... [2025-12-04 10:05:54.255560][5636774.761960229], took 0.04min 2025-12-04T10:05:54.2573219Z Parsing testcases for test report: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-053cfb337602f31d.xml 2025-12-04T10:05:54.3457552Z Failed to parse and upload json test reports: Unable to locate credentials 2025-12-04T10:05:54.3463537Z Running dynamo/test_guard_serialization 1/1 ... [2025-12-04 10:05:54.346056][5636774.852455008] 2025-12-04T10:05:54.3464248Z SCRIBE_GRAPHQL_ACCESS_TOKEN is NOT set 2025-12-04T10:05:54.3468208Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_guard_serialization.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:05:54.346378] 2025-12-04T13:53:50.8126749Z ##[error]The operation was canceled. 2025-12-04T13:53:50.8190142Z ##[group]Run # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct 2025-12-04T13:53:50.8191107Z # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct 2025-12-04T13:53:50.8192315Z docker exec -t "0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7" sh -c "cd ../pytorch && sudo cp -R test/test-reports ../workspace/test" 2025-12-04T13:53:50.8203118Z shell: /usr/bin/bash -e {0} 2025-12-04T13:53:50.8203470Z env: 2025-12-04T13:53:50.8203763Z GIT_DEFAULT_BRANCH: main 2025-12-04T13:53:50.8204202Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T13:53:50.8204771Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T13:53:50.8205310Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T13:53:50.8206659Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T13:53:50.8208039Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T13:53:50.8208407Z AWS_REGION: us-east-1 2025-12-04T13:53:50.8208854Z AWS_ACCESS_KEY_ID: *** 2025-12-04T13:53:50.8209335Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T13:53:50.8216077Z AWS_SESSION_TOKEN: *** 2025-12-04T13:53:50.8216611Z CONTAINER_NAME: 0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7 2025-12-04T13:53:50.8217182Z ##[endgroup] 2025-12-04T13:53:50.9054674Z ##[group]Run docker exec -t "0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7" sh -c "sudo chown -R 1001:1001 test" 2025-12-04T13:53:50.9055918Z docker exec -t "0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7" sh -c "sudo chown -R 1001:1001 test" 2025-12-04T13:53:50.9066110Z shell: /usr/bin/bash -e {0} 2025-12-04T13:53:50.9066492Z env: 2025-12-04T13:53:50.9066789Z GIT_DEFAULT_BRANCH: main 2025-12-04T13:53:50.9067241Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T13:53:50.9067838Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T13:53:50.9068380Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T13:53:50.9069643Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T13:53:50.9070869Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T13:53:50.9071257Z AWS_REGION: us-east-1 2025-12-04T13:53:50.9071718Z AWS_ACCESS_KEY_ID: *** 2025-12-04T13:53:50.9072209Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T13:53:50.9079049Z AWS_SESSION_TOKEN: *** 2025-12-04T13:53:50.9079591Z CONTAINER_NAME: 0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7 2025-12-04T13:53:50.9080179Z ##[endgroup] 2025-12-04T13:53:50.9977879Z ##[group]Run cat test/**/*_toprint.log || true 2025-12-04T13:53:50.9978406Z cat test/**/*_toprint.log || true 2025-12-04T13:53:50.9988372Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T13:53:50.9988845Z env: 2025-12-04T13:53:50.9989142Z GIT_DEFAULT_BRANCH: main 2025-12-04T13:53:50.9989577Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T13:53:50.9990381Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T13:53:50.9990860Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T13:53:50.9992116Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T13:53:50.9993333Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T13:53:50.9993703Z AWS_REGION: us-east-1 2025-12-04T13:53:50.9994163Z AWS_ACCESS_KEY_ID: *** 2025-12-04T13:53:50.9994695Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T13:53:51.0001396Z AWS_SESSION_TOKEN: *** 2025-12-04T13:53:51.0001943Z CONTAINER_NAME: 0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7 2025-12-04T13:53:51.0002526Z ##[endgroup] 2025-12-04T13:53:51.0075204Z Test results will be stored in test-reports/python-pytest/dynamo.test_guard_serialization/dynamo.test_guard_serialization-42fc42d5fcd955ef.xml 2025-12-04T13:53:51.0076486Z ============================= test session starts ============================== 2025-12-04T13:53:51.0077248Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T13:53:51.0077874Z cachedir: .pytest_cache 2025-12-04T13:53:51.0078609Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T13:53:51.0079426Z rootdir: /var/lib/jenkins/pytorch 2025-12-04T13:53:51.0079976Z configfile: pytest.ini 2025-12-04T13:53:51.0080738Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T13:53:51.0081765Z collecting ... collected 56 items 2025-12-04T13:53:51.0082230Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T13:53:51.0107878Z Running 50 items in this shard: test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module, test/dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module 2025-12-04T13:53:51.0132903Z 2025-12-04T13:53:51.0133454Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.2417s] [ 2%] 2025-12-04T13:53:51.0134644Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1276s] [ 2%] 2025-12-04T13:53:51.0135993Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1182s] [ 2%] 2025-12-04T13:53:51.0137159Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1405s] [ 2%] 2025-12-04T13:53:51.0138325Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1336s] [ 2%] 2025-12-04T13:53:51.0139487Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1309s] [ 2%] 2025-12-04T13:53:51.0140642Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1303s] [ 2%] 2025-12-04T13:53:51.0141814Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1466s] [ 2%] 2025-12-04T13:53:51.0143005Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1314s] [ 2%] 2025-12-04T13:53:51.0144172Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1323s] [ 2%] 2025-12-04T13:53:51.0145347Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1310s] [ 2%] 2025-12-04T13:53:51.0146563Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1293s] [ 2%] 2025-12-04T13:53:51.0147732Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1283s] [ 2%] 2025-12-04T13:53:51.0148897Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1275s] [ 2%] 2025-12-04T13:53:51.0150059Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1314s] [ 2%] 2025-12-04T13:53:51.0151160Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1416s] [ 2%] 2025-12-04T13:53:51.0152355Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1318s] [ 2%] 2025-12-04T13:53:51.0153506Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1311s] [ 2%] 2025-12-04T13:53:51.0154659Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1291s] [ 2%] 2025-12-04T13:53:51.0155818Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1311s] [ 2%] 2025-12-04T13:53:51.0157017Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1378s] [ 2%] 2025-12-04T13:53:51.0158164Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1289s] [ 2%] 2025-12-04T13:53:51.0159326Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1467s] [ 2%] 2025-12-04T13:53:51.0160485Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1231s] [ 2%] 2025-12-04T13:53:51.0161585Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1278s] [ 2%] 2025-12-04T13:53:51.0162744Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1421s] [ 2%] 2025-12-04T13:53:51.0164464Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module E1204 10:11:02.737000 299498 site-packages/torch/testing/_internal/common_distributed.py:1487] Thread 0 terminated or timed out after 300 seconds 2025-12-04T13:53:51.0166068Z E1204 10:11:02.737000 299498 site-packages/torch/testing/_internal/common_distributed.py:1487] 2025-12-04T13:53:51.0166698Z FAILED [300.0075s] [ 2%] 2025-12-04T13:53:51.0167536Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1280s] [ 2%] 2025-12-04T13:53:51.0168704Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1220s] [ 2%] 2025-12-04T13:53:51.0169864Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1255s] [ 2%] 2025-12-04T13:53:51.0171011Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1308s] [ 2%] 2025-12-04T13:53:51.0172174Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1439s] [ 2%] 2025-12-04T13:53:51.0173338Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.3185s] [ 2%] 2025-12-04T13:53:51.0174496Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1391s] [ 2%] 2025-12-04T13:53:51.0175664Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1342s] [ 2%] 2025-12-04T13:53:51.0176921Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1389s] [ 2%] 2025-12-04T13:53:51.0178085Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1166s] [ 2%] 2025-12-04T13:53:51.0179250Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1308s] [ 2%] 2025-12-04T13:53:51.0180417Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1340s] [ 2%] 2025-12-04T13:53:51.0181536Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1385s] [ 2%] 2025-12-04T13:53:51.0182707Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1284s] [ 2%] 2025-12-04T13:53:51.0183947Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1348s] [ 2%] 2025-12-04T13:53:51.0185113Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1388s] [ 2%] 2025-12-04T13:53:51.0186354Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1452s] [ 2%] 2025-12-04T13:53:51.0187530Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1389s] [ 2%] 2025-12-04T13:53:51.0188701Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1431s] [ 2%] 2025-12-04T13:53:51.0189866Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1318s] [ 2%] 2025-12-04T13:53:51.0191003Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1385s] [ 2%] 2025-12-04T13:53:51.0192158Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1354s] [ 2%] 2025-12-04T13:53:51.0193310Z dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module PASSED [0.1315s] [ 2%] 2025-12-04T13:53:51.0193963Z 2025-12-04T13:53:51.0194145Z =================================== FAILURES =================================== 2025-12-04T13:53:51.0194785Z _______ TestGuardSerializationFSDP.test_guard_serialization_fsdp_module ________ 2025-12-04T13:53:51.0195430Z Traceback (most recent call last): 2025-12-04T13:53:51.0196339Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1322, in wrapper 2025-12-04T13:53:51.0197154Z self._join_threads(self.threads, fn) 2025-12-04T13:53:51.0198073Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1463, in _join_threads 2025-12-04T13:53:51.0198951Z cls._check_return_codes(failed_ranks, timeout, fn) 2025-12-04T13:53:51.0199844Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 1488, in _check_return_codes 2025-12-04T13:53:51.0200650Z raise RuntimeError(msg) 2025-12-04T13:53:51.0201168Z RuntimeError: Thread 0 terminated or timed out after 300 seconds 2025-12-04T13:53:51.0201543Z 2025-12-04T13:53:51.0201808Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0202373Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0202914Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0203458Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0204003Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0204544Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0205082Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0205627Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0206220Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0206758Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0207293Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0207825Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0208359Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0208891Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0209421Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0209965Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0210569Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0211039Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0211578Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0212112Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0212653Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0213185Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0213717Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0214252Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0214778Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0215313Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0215849Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0216435Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0216969Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0217495Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0218031Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0218570Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0219107Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0219641Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0220170Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0220832Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0221384Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0222054Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0222581Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0223116Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0223641Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0225985Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0226515Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0227053Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0227579Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0228111Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0228638Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0229179Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0229708Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0230245Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0230771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0231306Z stats [('calls_captured', 4), ('unique_graphs', 2)] 2025-12-04T13:53:51.0231836Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T13:53:51.0232365Z stats [('calls_captured', 2), ('unique_graphs', 1)] 2025-12-04T13:53:51.0233467Z - generated xml file: /var/lib/jenkins/pytorch/test/test-reports/python-pytest/dynamo.test_guard_serialization/dynamo.test_guard_serialization-42fc42d5fcd955ef.xml - 2025-12-04T13:53:51.0234547Z =========================== short test summary info ============================ 2025-12-04T13:53:51.0235675Z FAILED [300.0075s] dynamo/test_guard_serialization.py::TestGuardSerializationFSDP::test_guard_serialization_fsdp_module - RuntimeError: Thread 0 terminated or timed out after 300 seconds 2025-12-04T13:53:51.0236913Z =================== 1 failed, 49 passed in 306.99s (0:05:06) =================== 2025-12-04T13:53:51.0405253Z Prepare all required actions 2025-12-04T13:53:51.0405815Z Getting action download info 2025-12-04T13:53:51.4190527Z Download action repository 'seemethere/upload-artifact-s3@v5' (SHA:baba72d0712b404f646cebe0730933554ebce96a) 2025-12-04T13:53:52.2776825Z Download action repository 'actions/upload-artifact@v4' (SHA:ea165f8d65b6e75b540449e92b4886f43607fa02) 2025-12-04T13:53:53.2209764Z ##[group]Run ./.github/actions/upload-test-artifacts 2025-12-04T13:53:53.2210066Z with: 2025-12-04T13:53:53.2210250Z use-gha: true 2025-12-04T13:53:53.2210556Z file-suffix: test-default-6-6-linux.rocm.gpu.gfx942.1.b_57116213184 2025-12-04T13:53:53.2210906Z s3-bucket: gha-artifacts 2025-12-04T13:53:53.2211123Z env: 2025-12-04T13:53:53.2211308Z GIT_DEFAULT_BRANCH: main 2025-12-04T13:53:53.2211581Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T13:53:53.2211949Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T13:53:53.2212314Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T13:53:53.2213100Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T13:53:53.2213875Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T13:53:53.2214113Z AWS_REGION: us-east-1 2025-12-04T13:53:53.2214424Z AWS_ACCESS_KEY_ID: *** 2025-12-04T13:53:53.2214732Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T13:53:53.2218969Z AWS_SESSION_TOKEN: *** 2025-12-04T13:53:53.2219308Z CONTAINER_NAME: 0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7 2025-12-04T13:53:53.2219673Z ##[endgroup] 2025-12-04T13:53:53.2272920Z ##[group]Run actions/upload-artifact@v4 2025-12-04T13:53:53.2273042Z with: 2025-12-04T13:53:53.2273222Z name: test-jsons-runattempt1-test-default-6-6-linux.rocm.gpu.gfx942.1.b_57116213184.zip 2025-12-04T13:53:53.2273424Z retention-days: 14 2025-12-04T13:53:53.2273529Z if-no-files-found: warn 2025-12-04T13:53:53.2273635Z path: test/**/*.json 2025-12-04T13:53:53.2273736Z compression-level: 6 2025-12-04T13:53:53.2273834Z overwrite: false 2025-12-04T13:53:53.2273934Z include-hidden-files: false 2025-12-04T13:53:53.2274108Z env: 2025-12-04T13:53:53.2274197Z GIT_DEFAULT_BRANCH: main 2025-12-04T13:53:53.2274330Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T13:53:53.2274502Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T13:53:53.2274664Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T13:53:53.2275042Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T13:53:53.2275406Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T13:53:53.2275519Z AWS_REGION: us-east-1 2025-12-04T13:53:53.2275647Z AWS_ACCESS_KEY_ID: *** 2025-12-04T13:53:53.2275792Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T13:53:53.2277893Z AWS_SESSION_TOKEN: *** 2025-12-04T13:53:53.2278055Z CONTAINER_NAME: 0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7 2025-12-04T13:53:53.2278231Z ##[endgroup] 2025-12-04T13:53:53.6716438Z With the provided path, there will be 6 files uploaded 2025-12-04T13:53:53.6718610Z Artifact name is valid! 2025-12-04T13:53:53.6719451Z Root directory input is valid! 2025-12-04T13:53:53.9001018Z Beginning upload of artifact content to blob storage 2025-12-04T13:53:54.3129263Z Uploaded bytes 46621 2025-12-04T13:53:54.3893162Z Finished uploading artifact content to blob storage! 2025-12-04T13:53:54.3897650Z SHA256 digest of uploaded artifact zip is aa3043c36389f3bd7b4af1ab573c2e0b0a7feffde415feb8d5a6c08e8cf8f939 2025-12-04T13:53:54.3899716Z Finalizing artifact upload 2025-12-04T13:53:54.5411777Z Artifact test-jsons-runattempt1-test-default-6-6-linux.rocm.gpu.gfx942.1.b_57116213184.zip.zip successfully finalized. Artifact ID 4764768976 2025-12-04T13:53:54.5413435Z Artifact test-jsons-runattempt1-test-default-6-6-linux.rocm.gpu.gfx942.1.b_57116213184.zip has been successfully uploaded! Final size is 46621 bytes. Artifact ID is 4764768976 2025-12-04T13:53:54.5421106Z Artifact download URL: https://github.com/pytorch/pytorch/actions/runs/19922849170/artifacts/4764768976 2025-12-04T13:53:54.5619897Z ##[group]Run actions/upload-artifact@v4 2025-12-04T13:53:54.5620334Z with: 2025-12-04T13:53:54.5620943Z name: test-reports-runattempt1-test-default-6-6-linux.rocm.gpu.gfx942.1.b_57116213184.zip 2025-12-04T13:53:54.5621635Z retention-days: 14 2025-12-04T13:53:54.5621995Z if-no-files-found: ignore 2025-12-04T13:53:54.5622383Z path: test/**/*.xml test/**/*.csv 2025-12-04T13:53:54.5622792Z compression-level: 6 2025-12-04T13:53:54.5623134Z overwrite: false 2025-12-04T13:53:54.5623494Z include-hidden-files: false 2025-12-04T13:53:54.5623858Z env: 2025-12-04T13:53:54.5624154Z GIT_DEFAULT_BRANCH: main 2025-12-04T13:53:54.5624623Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T13:53:54.5625213Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T13:53:54.5625762Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T13:53:54.5627117Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T13:53:54.5628333Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T13:53:54.5628718Z AWS_REGION: us-east-1 2025-12-04T13:53:54.5629195Z AWS_ACCESS_KEY_ID: *** 2025-12-04T13:53:54.5629697Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T13:53:54.5636801Z AWS_SESSION_TOKEN: *** 2025-12-04T13:53:54.5637358Z CONTAINER_NAME: 0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7 2025-12-04T13:53:54.5637948Z ##[endgroup] 2025-12-04T13:53:55.0287832Z With the provided path, there will be 34 files uploaded 2025-12-04T13:53:55.0293519Z Artifact name is valid! 2025-12-04T13:53:55.0294139Z Root directory input is valid! 2025-12-04T13:53:55.2590756Z Beginning upload of artifact content to blob storage 2025-12-04T13:53:56.9801941Z Uploaded bytes 2945320 2025-12-04T13:53:57.0514708Z Finished uploading artifact content to blob storage! 2025-12-04T13:53:57.0516210Z SHA256 digest of uploaded artifact zip is 12e53ceed6a7c467d28f6f28c252c52fe81f5436baa519449a161a49c4355325 2025-12-04T13:53:57.0518436Z Finalizing artifact upload 2025-12-04T13:53:57.2029874Z Artifact test-reports-runattempt1-test-default-6-6-linux.rocm.gpu.gfx942.1.b_57116213184.zip.zip successfully finalized. Artifact ID 4764769489 2025-12-04T13:53:57.2031584Z Artifact test-reports-runattempt1-test-default-6-6-linux.rocm.gpu.gfx942.1.b_57116213184.zip has been successfully uploaded! Final size is 2945320 bytes. Artifact ID is 4764769489 2025-12-04T13:53:57.2040292Z Artifact download URL: https://github.com/pytorch/pytorch/actions/runs/19922849170/artifacts/4764769489 2025-12-04T13:53:57.2306668Z ##[group]Run actions/upload-artifact@v4 2025-12-04T13:53:57.2307125Z with: 2025-12-04T13:53:57.2307707Z name: logs-runattempt1-test-default-6-6-linux.rocm.gpu.gfx942.1.b_57116213184.zip 2025-12-04T13:53:57.2308377Z retention-days: 14 2025-12-04T13:53:57.2308749Z if-no-files-found: ignore 2025-12-04T13:53:57.2309154Z path: usage_log.txt test/**/*.log 2025-12-04T13:53:57.2309574Z compression-level: 6 2025-12-04T13:53:57.2309930Z overwrite: false 2025-12-04T13:53:57.2310290Z include-hidden-files: false 2025-12-04T13:53:57.2310671Z env: 2025-12-04T13:53:57.2310980Z GIT_DEFAULT_BRANCH: main 2025-12-04T13:53:57.2311453Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T13:53:57.2312065Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T13:53:57.2312634Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T13:53:57.2314277Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T13:53:57.2315526Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T13:53:57.2316442Z AWS_REGION: us-east-1 2025-12-04T13:53:57.2316934Z AWS_ACCESS_KEY_ID: *** 2025-12-04T13:53:57.2317465Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T13:53:57.2324252Z AWS_SESSION_TOKEN: *** 2025-12-04T13:53:57.2324827Z CONTAINER_NAME: 0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7 2025-12-04T13:53:57.2325429Z ##[endgroup] 2025-12-04T13:53:57.6747654Z Multiple search paths detected. Calculating the least common ancestor of all paths 2025-12-04T13:53:57.6749079Z The least common ancestor is /home/runner/_work/pytorch/pytorch. This will be the root directory of the artifact 2025-12-04T13:53:57.6749937Z With the provided path, there will be 32 files uploaded 2025-12-04T13:53:57.6750651Z Artifact name is valid! 2025-12-04T13:53:57.6751047Z Root directory input is valid! 2025-12-04T13:53:57.9263846Z Beginning upload of artifact content to blob storage 2025-12-04T13:53:58.7628386Z Uploaded bytes 628509 2025-12-04T13:53:58.8331721Z Finished uploading artifact content to blob storage! 2025-12-04T13:53:58.8335453Z SHA256 digest of uploaded artifact zip is a32d99b65ef83e0d59a2cd0712c8c35929b8fee7eceffed116597f8d51bfcc75 2025-12-04T13:53:58.8338346Z Finalizing artifact upload 2025-12-04T13:53:58.9836212Z Artifact logs-runattempt1-test-default-6-6-linux.rocm.gpu.gfx942.1.b_57116213184.zip.zip successfully finalized. Artifact ID 4764769830 2025-12-04T13:53:58.9837770Z Artifact logs-runattempt1-test-default-6-6-linux.rocm.gpu.gfx942.1.b_57116213184.zip has been successfully uploaded! Final size is 628509 bytes. Artifact ID is 4764769830 2025-12-04T13:53:58.9847738Z Artifact download URL: https://github.com/pytorch/pytorch/actions/runs/19922849170/artifacts/4764769830 2025-12-04T13:53:59.0099420Z ##[group]Run # shellcheck disable=SC2156 2025-12-04T13:53:59.0099986Z # shellcheck disable=SC2156 2025-12-04T13:53:59.0100748Z find . -iname "core.[1-9]*" -exec docker exec "${CONTAINER_NAME}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \; 2025-12-04T13:53:59.0110512Z shell: /usr/bin/bash -e {0} 2025-12-04T13:53:59.0111045Z env: 2025-12-04T13:53:59.0112070Z GIT_DEFAULT_BRANCH: main 2025-12-04T13:53:59.0112676Z RUNNER_ARTIFACT_DIR: /home/runner/_work/_temp/artifacts 2025-12-04T13:53:59.0113538Z RUNNER_TEST_RESULTS_DIR: /home/runner/_work/_temp/test-results 2025-12-04T13:53:59.0114249Z RUNNER_DOCS_DIR: /home/runner/_work/_temp/docs 2025-12-04T13:53:59.0115716Z GPU_FLAG: --device=/dev/mem --device=/dev/kfd --group-add 110 --device /dev/dri/renderD144 --group-add video --group-add 109 --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host 2025-12-04T13:53:59.0117256Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T13:53:59.0117897Z AWS_REGION: us-east-1 2025-12-04T13:53:59.0118616Z AWS_ACCESS_KEY_ID: *** 2025-12-04T13:53:59.0119285Z AWS_SECRET_ACCESS_KEY: *** 2025-12-04T13:53:59.0126245Z AWS_SESSION_TOKEN: *** 2025-12-04T13:53:59.0126940Z CONTAINER_NAME: 0c85769212a4e0226d2955c8e3682214ae37283d407cbcdd8311c48970b0a4a7 2025-12-04T13:53:59.0127726Z ##[endgroup] 2025-12-04T13:53:59.1647775Z Post job cleanup. 2025-12-04T13:53:59.1685682Z Post job cleanup. 2025-12-04T13:53:59.1904159Z Logging out of registry 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T13:53:59.2135715Z Post job cleanup. 2025-12-04T13:53:59.2841040Z Post job cleanup. 2025-12-04T13:53:59.2902915Z Post job cleanup. 2025-12-04T13:53:59.3387963Z [command]/usr/bin/git version 2025-12-04T13:53:59.3421595Z git version 2.52.0 2025-12-04T13:53:59.3448763Z Copying '/home/runner/.gitconfig' to '/home/runner/_work/_temp/6a7d8fe5-d21c-43b6-928d-6f692b89efca/.gitconfig' 2025-12-04T13:53:59.3455210Z Temporarily overriding HOME='/home/runner/_work/_temp/6a7d8fe5-d21c-43b6-928d-6f692b89efca' before making global git config changes 2025-12-04T13:53:59.3456368Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T13:53:59.3457762Z [command]/usr/bin/git config --global --add safe.directory /home/runner/_work/pytorch/pytorch 2025-12-04T13:53:59.3487645Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T13:53:59.3505346Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T13:53:59.3741997Z Entering 'android/libs/fbjni' 2025-12-04T13:53:59.3779587Z Entering 'third_party/FP16' 2025-12-04T13:53:59.3805663Z Entering 'third_party/FXdiv' 2025-12-04T13:53:59.3838669Z Entering 'third_party/NNPACK' 2025-12-04T13:53:59.3909214Z Entering 'third_party/NVTX' 2025-12-04T13:53:59.3967638Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T13:53:59.4008522Z Entering 'third_party/XNNPACK' 2025-12-04T13:53:59.4039705Z Entering 'third_party/aiter' 2025-12-04T13:53:59.4075624Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T13:53:59.4110978Z Entering 'third_party/benchmark' 2025-12-04T13:53:59.4144066Z Entering 'third_party/composable_kernel' 2025-12-04T13:53:59.4182068Z Entering 'third_party/cpp-httplib' 2025-12-04T13:53:59.4203797Z Entering 'third_party/cpuinfo' 2025-12-04T13:53:59.4228719Z Entering 'third_party/cudnn_frontend' 2025-12-04T13:53:59.4256978Z Entering 'third_party/cutlass' 2025-12-04T13:53:59.4282510Z Entering 'third_party/fbgemm' 2025-12-04T13:53:59.4319190Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T13:53:59.4357188Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T13:53:59.4403966Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T13:53:59.4453797Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T13:53:59.4485896Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T13:53:59.4512979Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T13:53:59.4532807Z Entering 'third_party/fbgemm/external/json' 2025-12-04T13:53:59.4565647Z Entering 'third_party/flash-attention' 2025-12-04T13:53:59.4591557Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T13:53:59.4652961Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T13:53:59.4707352Z Entering 'third_party/flatbuffers' 2025-12-04T13:53:59.4764170Z Entering 'third_party/fmt' 2025-12-04T13:53:59.4809031Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T13:53:59.4839616Z Entering 'third_party/gloo' 2025-12-04T13:53:59.4876763Z Entering 'third_party/googletest' 2025-12-04T13:53:59.4899798Z Entering 'third_party/ideep' 2025-12-04T13:53:59.4933327Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T13:53:59.4964880Z Entering 'third_party/ittapi' 2025-12-04T13:53:59.4987357Z Entering 'third_party/kineto' 2025-12-04T13:53:59.5011466Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T13:53:59.5035074Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T13:53:59.5057299Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T13:53:59.5077731Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T13:53:59.5117584Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T13:53:59.5172728Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T13:53:59.5218060Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T13:53:59.5265651Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T13:53:59.5301936Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T13:53:59.5342567Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T13:53:59.5397030Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T13:53:59.5430523Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T13:53:59.5478824Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T13:53:59.5507273Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T13:53:59.5566429Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T13:53:59.5616366Z Entering 'third_party/kleidiai' 2025-12-04T13:53:59.5644309Z Entering 'third_party/mimalloc' 2025-12-04T13:53:59.5679628Z Entering 'third_party/nlohmann' 2025-12-04T13:53:59.5720095Z Entering 'third_party/onnx' 2025-12-04T13:53:59.5753449Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T13:53:59.5784179Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T13:53:59.5809297Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T13:53:59.5836361Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T13:53:59.5878406Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T13:53:59.5939280Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T13:53:59.5987305Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T13:53:59.6036218Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T13:53:59.6071553Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T13:53:59.6113487Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T13:53:59.6141010Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T13:53:59.6193061Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T13:53:59.6228206Z Entering 'third_party/pocketfft' 2025-12-04T13:53:59.6271496Z Entering 'third_party/protobuf' 2025-12-04T13:53:59.6300082Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T13:53:59.6343373Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T13:53:59.6387298Z Entering 'third_party/psimd' 2025-12-04T13:53:59.6409154Z Entering 'third_party/pthreadpool' 2025-12-04T13:53:59.6430223Z Entering 'third_party/pybind11' 2025-12-04T13:53:59.6451289Z Entering 'third_party/python-peachpy' 2025-12-04T13:53:59.6476553Z Entering 'third_party/sleef' 2025-12-04T13:53:59.6502306Z Entering 'third_party/tensorpipe' 2025-12-04T13:53:59.6525813Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T13:53:59.6561592Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T13:53:59.6592333Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T13:53:59.6630962Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T13:53:59.6666826Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T13:53:59.6736467Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T13:53:59.6759045Z http.https://github.com/.extraheader 2025-12-04T13:53:59.6771868Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-12-04T13:53:59.6806657Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T13:53:59.7023438Z Entering 'android/libs/fbjni' 2025-12-04T13:53:59.7024043Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7052317Z Entering 'third_party/FP16' 2025-12-04T13:53:59.7071043Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7102281Z Entering 'third_party/FXdiv' 2025-12-04T13:53:59.7122109Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7145490Z Entering 'third_party/NNPACK' 2025-12-04T13:53:59.7185018Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7224395Z Entering 'third_party/NVTX' 2025-12-04T13:53:59.7251444Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7280599Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T13:53:59.7304737Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7335827Z Entering 'third_party/XNNPACK' 2025-12-04T13:53:59.7363781Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7388856Z Entering 'third_party/aiter' 2025-12-04T13:53:59.7408252Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7439402Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T13:53:59.7456976Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7488156Z Entering 'third_party/benchmark' 2025-12-04T13:53:59.7507855Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7535766Z Entering 'third_party/composable_kernel' 2025-12-04T13:53:59.7561740Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7596326Z Entering 'third_party/cpp-httplib' 2025-12-04T13:53:59.7609597Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7636600Z Entering 'third_party/cpuinfo' 2025-12-04T13:53:59.7655130Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7685475Z Entering 'third_party/cudnn_frontend' 2025-12-04T13:53:59.7698885Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7727669Z Entering 'third_party/cutlass' 2025-12-04T13:53:59.7758370Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7790966Z Entering 'third_party/fbgemm' 2025-12-04T13:53:59.7817079Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7845997Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T13:53:59.7857550Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7895066Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T13:53:59.7912294Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7932636Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T13:53:59.7947125Z http.https://github.com/.extraheader 2025-12-04T13:53:59.7965189Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T13:53:59.7979467Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8002133Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T13:53:59.8025131Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8053803Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T13:53:59.8073001Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8094667Z Entering 'third_party/fbgemm/external/json' 2025-12-04T13:53:59.8128755Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8155384Z Entering 'third_party/flash-attention' 2025-12-04T13:53:59.8169419Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8196638Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T13:53:59.8214147Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8234876Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T13:53:59.8259638Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8281726Z Entering 'third_party/flatbuffers' 2025-12-04T13:53:59.8299485Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8317118Z Entering 'third_party/fmt' 2025-12-04T13:53:59.8330963Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8357451Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T13:53:59.8370296Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8407215Z Entering 'third_party/gloo' 2025-12-04T13:53:59.8432034Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8460411Z Entering 'third_party/googletest' 2025-12-04T13:53:59.8482136Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8499939Z Entering 'third_party/ideep' 2025-12-04T13:53:59.8521772Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8538552Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T13:53:59.8556347Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8579487Z Entering 'third_party/ittapi' 2025-12-04T13:53:59.8598855Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8626049Z Entering 'third_party/kineto' 2025-12-04T13:53:59.8652893Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8680625Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T13:53:59.8712617Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8728581Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T13:53:59.8744941Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8772250Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T13:53:59.8784567Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8801018Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T13:53:59.8811612Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8827908Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T13:53:59.8838423Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8853313Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T13:53:59.8866669Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8897010Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T13:53:59.8909587Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8924625Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T13:53:59.8936372Z http.https://github.com/.extraheader 2025-12-04T13:53:59.8977041Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T13:53:59.9002301Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9032563Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T13:53:59.9059553Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9076541Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T13:53:59.9088450Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9104896Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T13:53:59.9120350Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9147083Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T13:53:59.9164457Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9183446Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T13:53:59.9211917Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9230419Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T13:53:59.9242536Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9271544Z Entering 'third_party/kleidiai' 2025-12-04T13:53:59.9287533Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9316161Z Entering 'third_party/mimalloc' 2025-12-04T13:53:59.9329544Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9357079Z Entering 'third_party/nlohmann' 2025-12-04T13:53:59.9375316Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9404877Z Entering 'third_party/onnx' 2025-12-04T13:53:59.9423157Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9445711Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T13:53:59.9461681Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9502644Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T13:53:59.9521161Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9538120Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T13:53:59.9564818Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9581349Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T13:53:59.9594380Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9609581Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T13:53:59.9622887Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9637663Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T13:53:59.9649256Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9672173Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T13:53:59.9691913Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9706924Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T13:53:59.9718147Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9733077Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T13:53:59.9750582Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9767483Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T13:53:59.9789396Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9818908Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T13:53:59.9855079Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9887132Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T13:53:59.9905700Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9930789Z Entering 'third_party/pocketfft' 2025-12-04T13:53:59.9964715Z http.https://github.com/.extraheader 2025-12-04T13:53:59.9988050Z Entering 'third_party/protobuf' 2025-12-04T13:54:00.0021062Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0045348Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T13:54:00.0066538Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0096051Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T13:54:00.0123698Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0143108Z Entering 'third_party/psimd' 2025-12-04T13:54:00.0162951Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0181119Z Entering 'third_party/pthreadpool' 2025-12-04T13:54:00.0205130Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0224599Z Entering 'third_party/pybind11' 2025-12-04T13:54:00.0238156Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0254436Z Entering 'third_party/python-peachpy' 2025-12-04T13:54:00.0278759Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0302943Z Entering 'third_party/sleef' 2025-12-04T13:54:00.0331118Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0351667Z Entering 'third_party/tensorpipe' 2025-12-04T13:54:00.0392181Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0429936Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T13:54:00.0458699Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0494117Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T13:54:00.0522084Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0555287Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T13:54:00.0577131Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0606766Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T13:54:00.0629508Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0659638Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T13:54:00.0690652Z http.https://github.com/.extraheader 2025-12-04T13:54:00.0755911Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.0804217Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T13:54:00.1073364Z Entering 'android/libs/fbjni' 2025-12-04T13:54:00.1101036Z file:/home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T13:54:00.1123053Z Entering 'third_party/FP16' 2025-12-04T13:54:00.1139661Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T13:54:00.1156402Z Entering 'third_party/FXdiv' 2025-12-04T13:54:00.1181991Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T13:54:00.1191290Z Entering 'third_party/NNPACK' 2025-12-04T13:54:00.1212373Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T13:54:00.1222058Z Entering 'third_party/NVTX' 2025-12-04T13:54:00.1232935Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T13:54:00.1251464Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T13:54:00.1275298Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T13:54:00.1295620Z Entering 'third_party/XNNPACK' 2025-12-04T13:54:00.1307432Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T13:54:00.1323002Z Entering 'third_party/aiter' 2025-12-04T13:54:00.1340722Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T13:54:00.1350902Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T13:54:00.1371868Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T13:54:00.1385919Z Entering 'third_party/benchmark' 2025-12-04T13:54:00.1396988Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T13:54:00.1406126Z Entering 'third_party/composable_kernel' 2025-12-04T13:54:00.1416796Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T13:54:00.1436178Z Entering 'third_party/cpp-httplib' 2025-12-04T13:54:00.1447487Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T13:54:00.1467047Z Entering 'third_party/cpuinfo' 2025-12-04T13:54:00.1486668Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T13:54:00.1496407Z Entering 'third_party/cudnn_frontend' 2025-12-04T13:54:00.1506874Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T13:54:00.1527513Z Entering 'third_party/cutlass' 2025-12-04T13:54:00.1544314Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T13:54:00.1558669Z Entering 'third_party/fbgemm' 2025-12-04T13:54:00.1569326Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T13:54:00.1579610Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T13:54:00.1606225Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T13:54:00.1623990Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T13:54:00.1648022Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T13:54:00.1676038Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T13:54:00.1687149Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T13:54:00.1707532Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T13:54:00.1725713Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T13:54:00.1738923Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T13:54:00.1750687Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T13:54:00.1769014Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T13:54:00.1779424Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T13:54:00.1787705Z Entering 'third_party/fbgemm/external/json' 2025-12-04T13:54:00.1797628Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T13:54:00.1809769Z Entering 'third_party/flash-attention' 2025-12-04T13:54:00.1825659Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T13:54:00.1836023Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T13:54:00.1846283Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T13:54:00.1871752Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T13:54:00.1899286Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T13:54:00.1915151Z Entering 'third_party/flatbuffers' 2025-12-04T13:54:00.1925727Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T13:54:00.1936885Z Entering 'third_party/fmt' 2025-12-04T13:54:00.1952173Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T13:54:00.1961547Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T13:54:00.1971794Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T13:54:00.1992109Z Entering 'third_party/gloo' 2025-12-04T13:54:00.2003218Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T13:54:00.2012567Z Entering 'third_party/googletest' 2025-12-04T13:54:00.2022496Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T13:54:00.2030744Z Entering 'third_party/ideep' 2025-12-04T13:54:00.2040645Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T13:54:00.2050045Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T13:54:00.2063080Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T13:54:00.2076258Z Entering 'third_party/ittapi' 2025-12-04T13:54:00.2092019Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T13:54:00.2101269Z Entering 'third_party/kineto' 2025-12-04T13:54:00.2111615Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T13:54:00.2121111Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T13:54:00.2141025Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T13:54:00.2148817Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T13:54:00.2165045Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T13:54:00.2173683Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T13:54:00.2184581Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T13:54:00.2191875Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T13:54:00.2200669Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T13:54:00.2208765Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T13:54:00.2219268Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T13:54:00.2237794Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T13:54:00.2264022Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T13:54:00.2283529Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T13:54:00.2294457Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T13:54:00.2311313Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T13:54:00.2324782Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T13:54:00.2343468Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T13:54:00.2353985Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T13:54:00.2363123Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T13:54:00.2372127Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T13:54:00.2379959Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T13:54:00.2389404Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T13:54:00.2399667Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T13:54:00.2413545Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T13:54:00.2423581Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T13:54:00.2439821Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T13:54:00.2463943Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T13:54:00.2495335Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T13:54:00.2504001Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T13:54:00.2533291Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T13:54:00.2544741Z Entering 'third_party/kleidiai' 2025-12-04T13:54:00.2555061Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T13:54:00.2564858Z Entering 'third_party/mimalloc' 2025-12-04T13:54:00.2583464Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T13:54:00.2592332Z Entering 'third_party/nlohmann' 2025-12-04T13:54:00.2608708Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T13:54:00.2618513Z Entering 'third_party/onnx' 2025-12-04T13:54:00.2629075Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T13:54:00.2645900Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T13:54:00.2663390Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T13:54:00.2684917Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T13:54:00.2697318Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T13:54:00.2705633Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T13:54:00.2724963Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T13:54:00.2733719Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T13:54:00.2763455Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T13:54:00.2773077Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T13:54:00.2797798Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T13:54:00.2807671Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T13:54:00.2832329Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T13:54:00.2844109Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T13:54:00.2854610Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T13:54:00.2862825Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T13:54:00.2888340Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T13:54:00.2898296Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T13:54:00.2911162Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T13:54:00.2919615Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T13:54:00.2941617Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T13:54:00.2952124Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T13:54:00.2978744Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T13:54:00.3001988Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T13:54:00.3021016Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T13:54:00.3050746Z Entering 'third_party/pocketfft' 2025-12-04T13:54:00.3075875Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T13:54:00.3085845Z Entering 'third_party/protobuf' 2025-12-04T13:54:00.3102251Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T13:54:00.3112812Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T13:54:00.3140286Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T13:54:00.3150540Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T13:54:00.3162284Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T13:54:00.3175162Z Entering 'third_party/psimd' 2025-12-04T13:54:00.3191123Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T13:54:00.3201003Z Entering 'third_party/pthreadpool' 2025-12-04T13:54:00.3211875Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T13:54:00.3221685Z Entering 'third_party/pybind11' 2025-12-04T13:54:00.3231898Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T13:54:00.3251760Z Entering 'third_party/python-peachpy' 2025-12-04T13:54:00.3262924Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T13:54:00.3271712Z Entering 'third_party/sleef' 2025-12-04T13:54:00.3287178Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T13:54:00.3307718Z Entering 'third_party/tensorpipe' 2025-12-04T13:54:00.3322838Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T13:54:00.3332434Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T13:54:00.3357621Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T13:54:00.3378590Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T13:54:00.3389306Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T13:54:00.3396766Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T13:54:00.3422774Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T13:54:00.3442926Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T13:54:00.3462237Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T13:54:00.3471165Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T13:54:00.3480729Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T13:54:00.3516859Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3553262Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3586199Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3611970Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3648217Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3671958Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3709187Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3742008Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3772318Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3795188Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3830299Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3864933Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3899391Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3933340Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3956857Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.3989606Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4012218Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4041850Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4072137Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4095450Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4130896Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4153859Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4176126Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4210973Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4234788Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4272114Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4295235Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4331575Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4368149Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4400314Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4422991Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4458287Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4481582Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4515254Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4538406Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4563177Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4597813Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4635659Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4659466Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4693647Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4732267Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4770663Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4815232Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4855347Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4895356Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4930746Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4963381Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.4999958Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5023792Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5060296Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5094316Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5119593Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5142097Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5165271Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5188624Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5214964Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5240193Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5264680Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5289307Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5314153Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5339622Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5380749Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5417709Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5442008Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5482323Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5516678Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5541909Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5578437Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5615674Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5650005Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5681741Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5704876Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5738813Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5764120Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5787413Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5824323Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5858820Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5882089Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5906154Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5929843Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.5952998Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:00.6151617Z Post job cleanup. 2025-12-04T13:54:00.6621419Z [command]/usr/bin/git version 2025-12-04T13:54:00.6658692Z git version 2.52.0 2025-12-04T13:54:00.6686842Z Copying '/home/runner/.gitconfig' to '/home/runner/_work/_temp/c052769c-b643-4d81-9739-7bed55ce0b37/.gitconfig' 2025-12-04T13:54:00.6694658Z Temporarily overriding HOME='/home/runner/_work/_temp/c052769c-b643-4d81-9739-7bed55ce0b37' before making global git config changes 2025-12-04T13:54:00.6695739Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T13:54:00.6697090Z [command]/usr/bin/git config --global --add safe.directory /home/runner/_work/pytorch/pytorch 2025-12-04T13:54:00.6726193Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T13:54:00.6756040Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T13:54:00.6994317Z Entering 'android/libs/fbjni' 2025-12-04T13:54:00.7017380Z Entering 'third_party/FP16' 2025-12-04T13:54:00.7050703Z Entering 'third_party/FXdiv' 2025-12-04T13:54:00.7083074Z Entering 'third_party/NNPACK' 2025-12-04T13:54:00.7105505Z Entering 'third_party/NVTX' 2025-12-04T13:54:00.7126716Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T13:54:00.7150298Z Entering 'third_party/XNNPACK' 2025-12-04T13:54:00.7182421Z Entering 'third_party/aiter' 2025-12-04T13:54:00.7204222Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T13:54:00.7234772Z Entering 'third_party/benchmark' 2025-12-04T13:54:00.7257949Z Entering 'third_party/composable_kernel' 2025-12-04T13:54:00.7282216Z Entering 'third_party/cpp-httplib' 2025-12-04T13:54:00.7302847Z Entering 'third_party/cpuinfo' 2025-12-04T13:54:00.7323155Z Entering 'third_party/cudnn_frontend' 2025-12-04T13:54:00.7350124Z Entering 'third_party/cutlass' 2025-12-04T13:54:00.7377160Z Entering 'third_party/fbgemm' 2025-12-04T13:54:00.7402341Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T13:54:00.7432809Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T13:54:00.7463071Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T13:54:00.7490288Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T13:54:00.7514923Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T13:54:00.7541338Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T13:54:00.7598128Z Entering 'third_party/fbgemm/external/json' 2025-12-04T13:54:00.7655538Z Entering 'third_party/flash-attention' 2025-12-04T13:54:00.7689283Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T13:54:00.7722588Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T13:54:00.7773508Z Entering 'third_party/flatbuffers' 2025-12-04T13:54:00.7803318Z Entering 'third_party/fmt' 2025-12-04T13:54:00.7834844Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T13:54:00.7868419Z Entering 'third_party/gloo' 2025-12-04T13:54:00.7891729Z Entering 'third_party/googletest' 2025-12-04T13:54:00.7914508Z Entering 'third_party/ideep' 2025-12-04T13:54:00.7937924Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T13:54:00.7971721Z Entering 'third_party/ittapi' 2025-12-04T13:54:00.7994350Z Entering 'third_party/kineto' 2025-12-04T13:54:00.8029346Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T13:54:00.8078500Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T13:54:00.8117056Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T13:54:00.8138854Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T13:54:00.8190724Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T13:54:00.8247332Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T13:54:00.8285797Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T13:54:00.8337468Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T13:54:00.8372374Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T13:54:00.8417343Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T13:54:00.8458626Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T13:54:00.8491706Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T13:54:00.8521124Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T13:54:00.8547859Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T13:54:00.8570793Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T13:54:00.8599628Z Entering 'third_party/kleidiai' 2025-12-04T13:54:00.8622297Z Entering 'third_party/mimalloc' 2025-12-04T13:54:00.8647262Z Entering 'third_party/nlohmann' 2025-12-04T13:54:00.8670158Z Entering 'third_party/onnx' 2025-12-04T13:54:00.8719185Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T13:54:00.8748427Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T13:54:00.8772338Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T13:54:00.8795352Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T13:54:00.8846003Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T13:54:00.8888715Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T13:54:00.8932343Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T13:54:00.8955712Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T13:54:00.8995543Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T13:54:00.9032662Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T13:54:00.9078862Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T13:54:00.9124104Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T13:54:00.9168418Z Entering 'third_party/pocketfft' 2025-12-04T13:54:00.9211972Z Entering 'third_party/protobuf' 2025-12-04T13:54:00.9250198Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T13:54:00.9285796Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T13:54:00.9336402Z Entering 'third_party/psimd' 2025-12-04T13:54:00.9365562Z Entering 'third_party/pthreadpool' 2025-12-04T13:54:00.9396849Z Entering 'third_party/pybind11' 2025-12-04T13:54:00.9430365Z Entering 'third_party/python-peachpy' 2025-12-04T13:54:00.9451984Z Entering 'third_party/sleef' 2025-12-04T13:54:00.9484229Z Entering 'third_party/tensorpipe' 2025-12-04T13:54:00.9508858Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T13:54:00.9530764Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T13:54:00.9551406Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T13:54:00.9570431Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T13:54:00.9589554Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T13:54:00.9639225Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T13:54:00.9669614Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T13:54:00.9873277Z Entering 'android/libs/fbjni' 2025-12-04T13:54:00.9897201Z Entering 'third_party/FP16' 2025-12-04T13:54:00.9921776Z Entering 'third_party/FXdiv' 2025-12-04T13:54:00.9944816Z Entering 'third_party/NNPACK' 2025-12-04T13:54:00.9980956Z Entering 'third_party/NVTX' 2025-12-04T13:54:01.0006127Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T13:54:01.0027537Z Entering 'third_party/XNNPACK' 2025-12-04T13:54:01.0064224Z Entering 'third_party/aiter' 2025-12-04T13:54:01.0099040Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T13:54:01.0124474Z Entering 'third_party/benchmark' 2025-12-04T13:54:01.0148074Z Entering 'third_party/composable_kernel' 2025-12-04T13:54:01.0181825Z Entering 'third_party/cpp-httplib' 2025-12-04T13:54:01.0212343Z Entering 'third_party/cpuinfo' 2025-12-04T13:54:01.0234830Z Entering 'third_party/cudnn_frontend' 2025-12-04T13:54:01.0255866Z Entering 'third_party/cutlass' 2025-12-04T13:54:01.0283833Z Entering 'third_party/fbgemm' 2025-12-04T13:54:01.0312649Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T13:54:01.0357719Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T13:54:01.0403062Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T13:54:01.0427491Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T13:54:01.0456438Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T13:54:01.0481231Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T13:54:01.0523804Z Entering 'third_party/fbgemm/external/json' 2025-12-04T13:54:01.0551784Z Entering 'third_party/flash-attention' 2025-12-04T13:54:01.0574656Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T13:54:01.0598181Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T13:54:01.0660155Z Entering 'third_party/flatbuffers' 2025-12-04T13:54:01.0685919Z Entering 'third_party/fmt' 2025-12-04T13:54:01.0710355Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T13:54:01.0745188Z Entering 'third_party/gloo' 2025-12-04T13:54:01.0770615Z Entering 'third_party/googletest' 2025-12-04T13:54:01.0805030Z Entering 'third_party/ideep' 2025-12-04T13:54:01.0843703Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T13:54:01.0893944Z Entering 'third_party/ittapi' 2025-12-04T13:54:01.0917936Z Entering 'third_party/kineto' 2025-12-04T13:54:01.0942819Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T13:54:01.0980776Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T13:54:01.1025775Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T13:54:01.1051632Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T13:54:01.1097366Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T13:54:01.1148850Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T13:54:01.1186402Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T13:54:01.1232365Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T13:54:01.1264471Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T13:54:01.1315053Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T13:54:01.1358026Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T13:54:01.1399034Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T13:54:01.1436160Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T13:54:01.1484214Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T13:54:01.1510457Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T13:54:01.1551963Z Entering 'third_party/kleidiai' 2025-12-04T13:54:01.1597857Z Entering 'third_party/mimalloc' 2025-12-04T13:54:01.1642913Z Entering 'third_party/nlohmann' 2025-12-04T13:54:01.1700613Z Entering 'third_party/onnx' 2025-12-04T13:54:01.1742111Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T13:54:01.1781569Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T13:54:01.1806113Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T13:54:01.1837977Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T13:54:01.1877831Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T13:54:01.1917672Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T13:54:01.1959604Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T13:54:01.1990393Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T13:54:01.2022495Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T13:54:01.2078149Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T13:54:01.2102711Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T13:54:01.2126355Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T13:54:01.2165430Z Entering 'third_party/pocketfft' 2025-12-04T13:54:01.2199342Z Entering 'third_party/protobuf' 2025-12-04T13:54:01.2233242Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T13:54:01.2270936Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T13:54:01.2313087Z Entering 'third_party/psimd' 2025-12-04T13:54:01.2343082Z Entering 'third_party/pthreadpool' 2025-12-04T13:54:01.2369515Z Entering 'third_party/pybind11' 2025-12-04T13:54:01.2392603Z Entering 'third_party/python-peachpy' 2025-12-04T13:54:01.2414946Z Entering 'third_party/sleef' 2025-12-04T13:54:01.2435981Z Entering 'third_party/tensorpipe' 2025-12-04T13:54:01.2469493Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T13:54:01.2494886Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T13:54:01.2520102Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T13:54:01.2572056Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T13:54:01.2612580Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T13:54:01.2666706Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.2684647Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T13:54:01.2889091Z Entering 'android/libs/fbjni' 2025-12-04T13:54:01.2899607Z file:/home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T13:54:01.2909300Z Entering 'third_party/FP16' 2025-12-04T13:54:01.2919553Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T13:54:01.2928699Z Entering 'third_party/FXdiv' 2025-12-04T13:54:01.2945423Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T13:54:01.2954558Z Entering 'third_party/NNPACK' 2025-12-04T13:54:01.2965153Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T13:54:01.2974801Z Entering 'third_party/NVTX' 2025-12-04T13:54:01.2985324Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T13:54:01.2994680Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T13:54:01.3004682Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T13:54:01.3013902Z Entering 'third_party/XNNPACK' 2025-12-04T13:54:01.3024111Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T13:54:01.3039333Z Entering 'third_party/aiter' 2025-12-04T13:54:01.3049314Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T13:54:01.3059103Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T13:54:01.3067735Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T13:54:01.3080204Z Entering 'third_party/benchmark' 2025-12-04T13:54:01.3109232Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T13:54:01.3118766Z Entering 'third_party/composable_kernel' 2025-12-04T13:54:01.3129847Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T13:54:01.3141970Z Entering 'third_party/cpp-httplib' 2025-12-04T13:54:01.3159268Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T13:54:01.3178701Z Entering 'third_party/cpuinfo' 2025-12-04T13:54:01.3189548Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T13:54:01.3198523Z Entering 'third_party/cudnn_frontend' 2025-12-04T13:54:01.3208869Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T13:54:01.3217785Z Entering 'third_party/cutlass' 2025-12-04T13:54:01.3227699Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T13:54:01.3239964Z Entering 'third_party/fbgemm' 2025-12-04T13:54:01.3249965Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T13:54:01.3260958Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T13:54:01.3284942Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T13:54:01.3293696Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T13:54:01.3316151Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T13:54:01.3341904Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T13:54:01.3352276Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T13:54:01.3371446Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T13:54:01.3381011Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T13:54:01.3408235Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T13:54:01.3418238Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T13:54:01.3436068Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T13:54:01.3450602Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T13:54:01.3469374Z Entering 'third_party/fbgemm/external/json' 2025-12-04T13:54:01.3494535Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T13:54:01.3514636Z Entering 'third_party/flash-attention' 2025-12-04T13:54:01.3525295Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T13:54:01.3547822Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T13:54:01.3569482Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T13:54:01.3580203Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T13:54:01.3608693Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T13:54:01.3635338Z Entering 'third_party/flatbuffers' 2025-12-04T13:54:01.3652158Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T13:54:01.3663028Z Entering 'third_party/fmt' 2025-12-04T13:54:01.3673644Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T13:54:01.3682993Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T13:54:01.3699501Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T13:54:01.3719548Z Entering 'third_party/gloo' 2025-12-04T13:54:01.3736313Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T13:54:01.3745795Z Entering 'third_party/googletest' 2025-12-04T13:54:01.3761841Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T13:54:01.3782627Z Entering 'third_party/ideep' 2025-12-04T13:54:01.3794141Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T13:54:01.3803498Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T13:54:01.3824615Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T13:54:01.3838295Z Entering 'third_party/ittapi' 2025-12-04T13:54:01.3848134Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T13:54:01.3856844Z Entering 'third_party/kineto' 2025-12-04T13:54:01.3866445Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T13:54:01.3875088Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T13:54:01.3896772Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T13:54:01.3904198Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T13:54:01.3918118Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T13:54:01.3935691Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T13:54:01.3956261Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T13:54:01.3966164Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T13:54:01.3985016Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T13:54:01.3994466Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T13:54:01.4004290Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T13:54:01.4011994Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T13:54:01.4023730Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T13:54:01.4034528Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T13:54:01.4056800Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T13:54:01.4066714Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T13:54:01.4086560Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T13:54:01.4105493Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T13:54:01.4115283Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T13:54:01.4123705Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T13:54:01.4133769Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T13:54:01.4141819Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T13:54:01.4165246Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T13:54:01.4173995Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T13:54:01.4197526Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T13:54:01.4219881Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T13:54:01.4230042Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T13:54:01.4241383Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T13:54:01.4263738Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T13:54:01.4282003Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T13:54:01.4301731Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T13:54:01.4312587Z Entering 'third_party/kleidiai' 2025-12-04T13:54:01.4322613Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T13:54:01.4334432Z Entering 'third_party/mimalloc' 2025-12-04T13:54:01.4354717Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T13:54:01.4366011Z Entering 'third_party/nlohmann' 2025-12-04T13:54:01.4377489Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T13:54:01.4388274Z Entering 'third_party/onnx' 2025-12-04T13:54:01.4408166Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T13:54:01.4439223Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T13:54:01.4458101Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T13:54:01.4475017Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T13:54:01.4486023Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T13:54:01.4508115Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T13:54:01.4518543Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T13:54:01.4538112Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T13:54:01.4565171Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T13:54:01.4575560Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T13:54:01.4585021Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T13:54:01.4593332Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T13:54:01.4620085Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T13:54:01.4640875Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T13:54:01.4668245Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T13:54:01.4679932Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T13:54:01.4701305Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T13:54:01.4717845Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T13:54:01.4743177Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T13:54:01.4752773Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T13:54:01.4775303Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T13:54:01.4786536Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T13:54:01.4817887Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T13:54:01.4829825Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T13:54:01.4841084Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T13:54:01.4875979Z Entering 'third_party/pocketfft' 2025-12-04T13:54:01.4895647Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T13:54:01.4908561Z Entering 'third_party/protobuf' 2025-12-04T13:54:01.4943240Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T13:54:01.4967099Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T13:54:01.4986166Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T13:54:01.4995399Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T13:54:01.5012685Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T13:54:01.5026157Z Entering 'third_party/psimd' 2025-12-04T13:54:01.5054019Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T13:54:01.5065704Z Entering 'third_party/pthreadpool' 2025-12-04T13:54:01.5090653Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T13:54:01.5101945Z Entering 'third_party/pybind11' 2025-12-04T13:54:01.5128721Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T13:54:01.5151111Z Entering 'third_party/python-peachpy' 2025-12-04T13:54:01.5162652Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T13:54:01.5183397Z Entering 'third_party/sleef' 2025-12-04T13:54:01.5203458Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T13:54:01.5226235Z Entering 'third_party/tensorpipe' 2025-12-04T13:54:01.5243942Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T13:54:01.5253717Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T13:54:01.5283035Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T13:54:01.5291560Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T13:54:01.5310167Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T13:54:01.5322317Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T13:54:01.5341294Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T13:54:01.5351212Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T13:54:01.5370654Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T13:54:01.5390076Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T13:54:01.5414254Z file:/home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T13:54:01.5455686Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5496902Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5527653Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5566853Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5604823Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5635044Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5673984Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5710867Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5741644Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5778551Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5813315Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5851741Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5889825Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5923579Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5959949Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.5994376Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6021212Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6047029Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6075621Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6101892Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6139187Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6176354Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6202372Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6237403Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6274592Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6300530Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6335515Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6361773Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6399785Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6436657Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6469301Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6504792Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6540230Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6576176Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6612249Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6639082Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6665850Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6703586Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6741792Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6780357Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6818584Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6856976Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6903624Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6931355Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6963933Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.6989308Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7023333Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7063429Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7102641Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7139193Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7173715Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7201186Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7238630Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7264313Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7302008Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7326797Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7353188Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7389194Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7417997Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7456082Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7497011Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7522158Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7561490Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7588676Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7627216Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7652975Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7697315Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7723972Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7751929Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7777494Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7814013Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7849951Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7888218Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7941848Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7958157Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.7990365Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.8022817Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.8048419Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.8075075Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.8110930Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.8150771Z [command]/usr/bin/git config --file /home/runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T13:54:01.8360489Z Cleaning up orphan processes 2025-12-04T13:54:01.8585007Z Terminate orphan process: pid (17329) (docker)